Friday 4 September 2015

What's new in Storm-Crawler 0.6

We have just released version 0.6 of Storm-Crawler, an open source web crawling SDK based on Apache Storm. Storm-Crawler provides resources for building scalable, low-latency web crawlers and is used in production at various companies.

We have added loads of improvements and bug fixes since our previous release last June, thanks to the efforts of the community. The activity around the project has been very steady and a new committer (Jorge Luis Betancourt) has joined our ranks. We also had contributions from various users, which is great.

Here are the main features of version 0.6.

Dependencies upgrades

  • Storm 0.9.5
  • crawler-commons 0.6
  • Tika 1.10

Code reorganisation

  • Organise external content as separate sub-modules #145
  • Removed external/metrics #160

API changes

  • ParseFilter from interface to abstract class #159
  • Parse can output more than one document #135

New features and resources

  • SimpleFetcherBolt  enforces politeness #181
  • New RobotsURLFilter #178
  • New ContentFilter to restrict text of document to XPath match #150
  • Adding support for using the canonical URL in the IndexerBolts #161
  • Improvement to SitemapParserBolt #143
  • Enforce robots meta instructions #148
  • Expand XPathFilter to accept a list of expressions as an argument #153
  • JSoupParserBolt does a basic check of the content type #151

External resources

The external (non-core) resources have been separated into discrete sub-modules as their number was getting larger. 

Our brand new module for Apache SOLR (see #152) is comparable to the existing ElasticSearch equivalent and provides an IndexerBolt, a MetricsConsumer and a SOLRSpout and StatusUpdaterBolt.

Not all web crawls require scalable big data solutions. I conducted a survey of Apache Nutch users some time ago which showed that most people used it on a single machine and less than a million URL. These are often people crawling a single website. With that in mind, we added a spout and StatusUpdaterBolt implementations to use MySQL as a storage for URL status which is useful for small recursive crawls. See #172 for details.

AWS CloudSearch
There is also a new AWS module containing an IndexerBolt for Amazon CloudSearch (see #174). 

We hope that people find these improvements useful and would like to thank all users and contributors.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.