DigitalPebble's Blog: May 2017

Monday, 29 May 2017

What's new in StormCrawler 1.5

StormCrawler 1.5 has just been released! It is an important road mark with the move to Elasticsearch 5.x and the implementation of long-awaited features such as the Selenium-based protocol. The code has been improved in many ways and despite the seemingly low number of lines below, this new release is a mammoth one!

The project, in general, is in very good health, with more and more organisations using it in production, and an increased visibility, reflected by the growing number of questions on StackOverflow.

Here are the main changes in 1.5.

CORE DEPENDENCIES UPGRADES

Apache Storm 1.1.0 (#450)

CORE MODULE

HTTP Protocol: implement cookie handling (#32)
java.util.zip.ZipException: Not in GZIP format thrown on redirs with httpclient (#455)
Selenium-based protocol implementation (#144) which I described in a separate blog post
Indicate whether RobotsRules come from cache or have been fetched (#460)
Memory issues when ByteArrayBuffer gets instantiated with a large value despite maxLength being set (#462)
FetcherBolt to dump URLs being fetched to log (#464)
Override sitemapsAutoDiscovery settings per URL (#469)

Knowing whether RobotsRules come from the cache gives us more insights into the behaviour of the crawlers as we can display the ratio of cache vs live (see illustration below)

as well as pages fetched vs robots fetched.

ELASTICSEARCH

Utility class to export URL and metadata from ES index to file (#444)
Fixed sampling with aggregation spout in ES5
Upgrade to Elasticsearch 5.3 (#221 and #451)
Optimise nextFetchDate to speed up queries to Elasticsearch (#429 and #452)
Delete gone pages from index (#253)
metrics - remove filtering (#281)

One of the main changes related to Elasticsearch is the removal of ElasticsearchSpout and the introduction of CollapsingSpout, which uses the brand new FieldCollapsing in Elasticsearch. We also fixed a concurrency issue in the StatusUpdaterBolt (9fefac8), improved the efficiency of the spouts by getting them to process results in a separate thread (1b0fb42), which combined with the optimisation of nextFetchDate (see above) and the fix of the sampling in AggregationSpout, means that the Elasticsearch module is more efficient than ever.

The move to Elasticsearch 5.x was not without difficulties but the result justifies the effort. I described in a separate post the common pitfalls of upgrading an existing topology to Elasticsearch 5.

Coming next?

As usual, it is hard to guess what the next release will be made of as the project is driven by its community.

Having said that, I'd expect the Selenium-based protocol to get improved as users start to use it. It is also likely that we'll move away from Apache HttpClient library (#443). As mentioned in the previous release, we'll probably upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser.

In the meantime and as usual, thanks to all contributors and users and happy crawling!

Monday, 15 May 2017

Avoid these common pitfalls when upgrading StormCrawler with Elasticsearch 5.x

The next (and probably imminent) release of StormCrawler will contain an update of Elasticsearch to version 5.3. This is definitely a good thing, as we want to keep up with the latest versions of Elasticsearch but has a few pitfalls when upgrading your existing application. Some of the changes are documented in the README but I will reiterate them here, just in case.

LOG4J dependencies

ES5 requires an upgrade in the logging dependencies of Apache Storm. You can update the dependencies in your existing Storm cluster by hand but since my patch is part of Storm 1.1.0, you should probably upgrade Storm altogether. StormCrawler 1.5 will depend on Storm 1.1.0 (but probably works with older versions as well).

Maven Shade Configuration

The pom file of your StormCrawler-based project needs modifying as well, you'll need to specify the Maven Shade Configuration and include:

<manifestEntries>
 <Change></Change>
 <Build-Date></Build-Date>
</manifestEntries>

See https://github.com/elastic/elasticsearch/issues/21627; this wasn't an issue with the previous versions of Elasticsearch.

Update es-conf.yaml

In particular, the value of es.status.bucket.field used to be _routing, which is an automatically generated field, however this is not available for the spouts anymore. Instead, use the same value as es.status.routing.fieldname e.g. metadata.hostname.

Mapping

ES5 should be able to read your existing indices, however, if you create a new set of indices from scratch, make sure you use the latest version of the script.

I hope this will help you for a successful upgrade, I will cover the new functionalities and improvements coming with StormCrawler 1.5 when it is released.

Happy crawling