Friday 25 May 2018

What's new in StormCrawler 1.9


Dependency upgrades
Core
  • Crawl-delay in robots.txt should optionally not shrink the configured delay #549
  • Optimisation: faster extraction of META tags #553
  • CollectionMetric synchronized access to List #555
  • Configurable Robots Caches #557
  • JSOUPParserBolt: lazy DOM conversion #563
  • Purge internal queues of tuples which have already reached timeout #564
  • Added ParseFilter to convert single valued Metadata to multi-valued ones #571
  • Caching of redirected robots.txt may overwrite correct robots.txt rules, fixes #573
WARC
  • WARCBolt to handle incorrect URIs gracefully #560
  • WARCRecordFormat use ByteBuffer instead of ByteArrayOutputStream #561
Archetype
  • Uses flux-core 1.2.1 #559
  • Added FeedParser to archetype topology #551
  • Added .kml and .wmv to url filters
SOLR
  • MetricsConsumer handles recursive values #554
Elasticsearch
  • MetricsConsumer handles recursive values #554
  • ES Indexer and Deletion Bolts to get index name from constructor #572
LanguageID
  • Added option to LanguageID to skip if metadata already set #570
As usual, we advise all users to move to this version as it fixes several bugs. Thanks to all contributors and users. Happy crawling!