Storm-Crawler 0.7 has been released yesterday. This release fixes some bugs and provides numerous improvements, we advise users to upgrade to it. Here are the main changes:
- AbstractIndexingBolt to use status stream in declareOutputFields #190
- Change Status to ERROR when FETCH_ERROR above threshold #202
- FetcherBolt tracks cause of error in metadata
- Add default config file in resources #193
- FileSpout chokes on very large files #196
- Use Maven-Shade everywhere #199
- Ack tick tuples #194
- Remove PrinterBolt and IndexerBolt, added StdOutStatusUpdater #187
- Upgraded Tika to 1.11
This release contains many improvements to the Elasticsearch module :
- Added README with a getting started section
- IndexerBolt uses url as doc ID
- ESSpout : maxSecSinceQueriedDate param to avoid deep paging
- ElasticSearchSpout can random sort -> better diversity of URLs
- ElasticSearchSpout implements de/activate, counter for time spent querying, configurable result size
- Simple Kibana dashboards for metrics and status indices
- Metadata as structured object. Implements #197
- ES Spout - more metrics acked, failed, es queries and docs
- ESSeedInjector topology
- Index init script uses ttl for metrics
- Upgraded ES version to 1.7.2
The SOLR module has also received some attention :
- solr-metadata #210
- Cleaning some documentation and typo issues
- Remove outdated configuration options for solr module
These changes combined with the Kibana dashboard templates make it easy to monitor a crawl and get addition insights into its behaviour, as illustrated below.
Of course thanks to Storm's pluggable and versatile metrics mechanism, it is relatively easy to send metrics to other backends such as AWS Cloudwatch for instance.
Thanks to the various users and contributors who helped with this release.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.