DigitalPebble's Blog: What's new in Storm-Crawler 0.7

Storm-Crawler 0.7 has been released yesterday. This release fixes some bugs and provides numerous improvements, we advise users to upgrade to it. Here are the main changes:

AbstractIndexingBolt to use status stream in declareOutputFields #190

Change Status to ERROR when FETCH_ERROR above threshold #202

FetcherBolt tracks cause of error in metadata

Add default config file in resources #193

FileSpout chokes on very large files #196

Use Maven-Shade everywhere #199

Ack tick tuples #194

Remove PrinterBolt and IndexerBolt, added StdOutStatusUpdater #187

Upgraded Tika to 1.11

This release contains many improvements to the Elasticsearch module :

Added README with a getting started section

IndexerBolt uses url as doc ID

ESSpout : maxSecSinceQueriedDate param to avoid deep paging

ElasticSearchSpout can random sort -> better diversity of URLs

ElasticSearchSpout implements de/activate, counter for time spent querying, configurable result size

Simple Kibana dashboards for metrics and status indices

Metadata as structured object. Implements #197

ES Spout - more metrics acked, failed, es queries and docs

ESSeedInjector topology

Index init script uses ttl for metrics

Upgraded ES version to 1.7.2

The SOLR module has also received some attention :

solr-metadata #210

Cleaning some documentation and typo issues

Remove outdated configuration options for solr module

We also improved the metrics by adding a PerSecondReducer (#209) which is used by the FetcherBolts to provide page and byte per second metrics. The metrics names and codes got also improved - notably the gauges for ESSpout and FetcherBolt.

These changes combined with the Kibana dashboard templates make it easy to monitor a crawl and get addition insights into its behaviour, as illustrated below.

Of course thanks to Storm's pluggable and versatile metrics mechanism, it is relatively easy to send metrics to other backends such as AWS Cloudwatch for instance.

Thanks to the various users and contributors who helped with this release.

Wednesday, 4 November 2015

What's new in Storm-Crawler 0.7

No comments:

Post a Comment