So much for the "no-coding-Friday" but this is a bombhttps://t.co/8VIyChtelK
— DigitalPebble (@digitalpebble) February 18, 2022
If the parsing bolt in your #StormCrawler topology is a bit slow, you should definitely have a look at this one.
(and a big thank you to https://t.co/UGUVdHl0W1)
We also welcomed Richard Zowalla as a new committer on the project.
Here are the main changes.
Dependency upgrades
- Elasticsearch 7.17.0
- Tika 2.3.0
- Caffeine 2.9.3
Core
- Convert LinkParseFilter into a JSoupFilter (#944)
Rewrote LinkParseFilter + added XPathFilter + tests for JSOUPFilters (#953)
General Code Refactoring and Good Practices (#937)
Add unified way of initializing classes via string … (#943)
Changed order of emit outlinks and emit of parent url ... (#954)
Elasticsearch
URLFrontier
- Spout does not reconnect to URLFrontier if an exception occurs (#956)
The next release will probably include a new module for Elasticsearch 8, see #945. If you have some experience of using ES new client library, your contribution will be very welcome.
Thank you to all users and contributors, in particular Felix Engl for his work on the code refactoring and Julian Alvarez for reporting and fixing the bug in #954.
Our users Gage Piracy have also been very generous in donating some of the customisations we wrote for them back to the project.
Happy crawling!
No comments:
Post a Comment
Note: only a member of this blog may post a comment.