I have just released StormCrawler 1.17. As you can see in the list below, this contains important bugfixes and improvements. For this reason, we recommend that all users upgrade to this version, however, please check the breaking changes below if you apply it to an existing crawl.
Dependency upgrades
Core
- Use regular expressions for custom number of threads per queue fetcher #788
- /!breaking!/ Prefix protocol metadata #789
- Basic authentication for OKHTTP #792
- Utility to debug / test parsefilters #794
- /!breaking!/ Remove deprecated methods and fields enhancement #791
- AdaptiveScheduler to set last-modified time in metadata #777 #812
- /bugfix/ _fetch.exception_ key should be removed from metadata if subsequent fetches are successful #813
- /bugfix/ SimpleFetcherBolt maxThrottleSleepMSec not deactivated #814
- /!breaking!/ Index pages with content="noindex,follow" meta tag #750
- Enable extension parsing for SitemapParser enhancement parser #749 #815
WARC
Elasticsearch
- /bugfix/ AggregationSpout error due SimpleDateFormat not thread safe #809
- /bugfix/ IndexerBolt issue causing ack failures #801
- Allow ES to connect over a proxy #787
Of the breaking changes above, #789 is particularly important. If you want to use SC 1.17 on an existing crawl, make sure you add
protocol.md.prefix: ""
to the configuration. Similarly, http.skip.robots has changed to http.robots.file.skip
Thanks to all contributors and users! Happy crawling!
PS: something equally exciting is coming next ;-)