Tuesday 22 March 2022

What's new in StormCrawler 2.3

StormCrawler 2.3 was released yesterday. It contains a relatively small number of changes compared to previous releases but these include important bug fixes. We have also ported existing ParseFilters to JSoupParseFilters, leading to some noticeable performance improvements and an exuberant tweet


We also welcomed Richard Zowalla as a new committer on the project.

Here are the main changes.

Dependency upgrades

  • Elasticsearch 7.17.0 
  • Tika 2.3.0
  • Caffeine 2.9.3 

Core

  • Convert LinkParseFilter into a JSoupFilter (#944)
  • Rewrote LinkParseFilter + added XPathFilter + tests for JSOUPFilters (#953)

  • General Code Refactoring and Good Practices (#937)

  • Add unified way of initializing classes via string … (#943)

  • Changed order of emit outlinks and emit of parent url ... (#954

Elasticsearch 

  • Enable compression (#941
  • Enable _source for content index in ES archetype (#958

URLFrontier

  • Spout does not reconnect to URLFrontier if an exception occurs (#956

The next release will probably include a new module for Elasticsearch 8, see #945. If you have some experience of using ES new client library, your contribution will be very welcome.

Thank you to all users and contributors, in particular Felix Engl for his work on the code refactoring and Julian Alvarez for reporting and fixing the bug in #954.

Our users Gage Piracy have also been very generous in donating some of the customisations we wrote for them back to the project.

Happy crawling!

No comments:

Post a Comment

Note: only a member of this blog may post a comment.