Friday, 23 March 2018

Grafana StormCrawler metrics v4


The Grafana dashboard for StormCrawler is a good starting point for monitoring the behaviour of your StormCrawler topology. This is typically used with Elasticsearch as a storage backend for the metrics generated by Storm but should work with any other Storm-compatible backend like Grafite or CloudWatch. 

Some of the metrics are specific to the components from the Elasticsearch module (spout, status, indexer) but you can simply remove or modify them if you use e.g. SOLR (NOTE: there was a feature request in Grafana to add SOLR as a datasource but to my knowledge, this is not yet available).

The latest version (4) brings the following changes.

  • URLs waiting in queues 

The recent 1.8 release of StormCrawler added a new metrics for the FetcherBolt which allows tracking the amount of time URLs spend in the internal queues. This has been added to the "URLs waiting in queues" panel alongside the average population of the queues.

Average time spent in queues + average queues population

  • ES StatusUpdater
Instead of tracking the number of bulk requests sent in the last minute, we now have a panel showing the evolution over time. This information is for the ES StatusUpdaterBolt only.

ES status updater bulk requests
  • Acked in StatusBolt
This is a brand new panel which is not specific to Elasticsearch but operates on any component with 'status' for id and shows the number of tuples acked over time, broken down by source.  

Tuples acked by StatusUpdater
In the graph above, we can see a peak early in the crawl where most of the tuples acked came from the sitemap bolt. Please note that the values are stacked in this graph. Sitemap files are typically discovered early in a crawl and generate a large number of discovered URLs; this is not the case later on when most tuples come from the HTML parser.
  • Robots panel
We removed the robots panel as the number of HTTP requests to robots files is shown in the "Fetcher: pages fetched" panel anyway and after the initial few minutes of a crawl, the panel simply indicated that the robots files were mostly cached.
  • ES Indexed 
This is a new panel showing the number of documents indexed into Elasticsearch as well as the documents filtered out during the indexing.



No comments:

Post a Comment

Note: only a member of this blog may post a comment.