Friday 5 June 2015

What's new in Storm-Crawler 0.5



We've just released the version 0.5 of Storm-Crawler, just over three months after the previous one. As you can read below, we've been pretty busy! The project got some great contributions from new users and is seeing an increase in adoption, which is very encouraging.

Metadata and Outlinks


One of the main improvements provided in the new release is the introduction of a Metadata object which replaces the Map<String,String[]> that were used everywhere in our code as well as the KeyValues utility class which manipulated such Maps. This makes the code a lot simpler and more elegant.

A new MetadataTransfer class has been added to (a)  determine what metadata should be kept e.g. when persisting the information about a URL in a StatusUpdaterBolt but also to (b) determine what metadata should be transferred from the source document to its outlinks. This is a very useful feature that gets used quite often in practice.

Speaking of outlinks, they now have a proper class for representing them where the anchor and metadata for a given target URL are kept. Note that the parser bolts populate the metadata using the MetadataTransfer class described above prior to passing them to the ParseFilters. This means that a given ParseFilter can modify the outlinks for a given page or create completely new ones.


JSoupParserBolt

We got a present from our committer Gui whose company have kindly donated the a parsing bolt based on JSoup. This is now the one we use by default, the one based on Tika has been moved to the external part of the code. If you are crawling non-HTML pages then you should be using the Tika-based parser, otherwise the JSoup one is a lot lighter (both in code and dependencies) and works better when extracting data with XPath.

Abstract classes for persistence

We also added many useful resources for writing recursive crawlers, in addition to the status stream that came with the previous release. These can be found in the com.digitalpebble.storm.crawler.persistence  package. In particular, we added a new AbstractStatusUpdaterBolt class. As the name suggests, this is meant to be extended to store the tuples coming from the status stream into some sort of storage (e.g. Elasticsearch, SOLR, Cassandra, HBase etc...). The abstract class keeps an internal cache for newly discovered URLs so that the same URL does not get updated more than once in the backend. Obviously this cache would not outlive the bolt if it died so this should be seen merely as an optimization and not a 100% reliable filter. The abstract class then calls a Scheduler, which is a pluggable mechanism to define when a given URL should be fetched next based on its metadata and status. The default scheduler simply relies on the configuration set by the user.

We also added a new AbstractIndexerBolt class, to simplify writing indexing bolts by allowing the users to specify what metadata to index via the configuration. This greatly simplifies writing an IndexerBolt.



Elasticsearch

These new classes above have been used for our Elasticsearch bolts and spout. We now have :
These 3 components allow to build a recursive crawler with Elasticsearch. We also added an example topology to illustrate how to do this as well as an init script which defines the schemas of the indices.

As a bonus we wrote a MetricsConsumer which can be plugged into the Storm metrics mechanism so that they get indexed in Elasticsearch. This would be typically used by Kibana as a way of monitoring the performance of the crawler with the metrics generated by the spouts and bolts e.g. bytes per second, pages fetched etc... I had suggested it to the elasticsearch-hadoop community but it hasn't attracted much interest so far.

We will probably provide a schema file for Kibana so that users can load a standard dashboard for displaying the metrics. We just need to wait for the next release of Kibana which will contain #1552.

Miscellaneous and next steps

We've replaced the old http protocol implementation we'd borrowed from Nutch with a brand new one based on Apache HttpClient. Less code to maintain and it is also more robust, particularly on https pages.


Apart from that, we improved our WIKI pages, upgraded some dependencies (Tika to 1.8, ES to 1.5.1, Storm to 0.9.4), added some resources (e.g. MaxDepthFilter), removed some deprecated ones (#126) and fixed numerous bugs. 



As I said, we've been pretty busy and it looks like this is set to continue with the 0.6 release. It will probably contain #117 as well as resources for Apache SOLR.

Thanks to everyone who contributed to this release in any way.