Tuesday 12 July 2016

StormCrawler : the Coming of Age

I am very happy to announce the release of StormCrawler 1.0. It has taken a few years (and more specifically 791 commits from 15 contributors and 10 releases) to evolve from what was just an intuition to a piece of software which is now mature, stable and used in production by various companies.

The major release number reflects the version of Apache Storm, as we switched from Storm 0.10 to 1.0, however our minor number will not necessarily track the one used in Storm. The move to 1.0 also reflects the maturity of StormCrawler.

The main changes compared to the previous release are :
  • Moved to Storm 1.x (#295)
  • Upgrade to Java 8 (#308)
  • Renamed packages storm.crawler into stormcrawler (#306)
  • Added Flux equivalent to the example topology class (#286)
  • FetcherBolt simplify access to OutputCollector (#278)
  • JSoupParser detects mimetype with Tika #303
  • Elasticsearch : remote TTL from metrics index (#296)
  • Elasticsearch : sampler aggregation spout (#305)
  • Tika : Provide clues to Tika parser for indentification of mimetype (#302)
  • Use metadata keys last-modified and etag (#109)
  • URLFilter based on metadata (#312)
plus several minor changes and bug fixes.

Let's have a closer look at some of the changes above.

Flux

Flux is a very elegant resource for defining and deploying topologies on Apache Storm. The simple crawl topology generated by the archetype now contains a Flux equivalent of the Java topology class. This means that you don't need to know Java to define a topology but also that you don't need to recompile the jar every time you make a small change to the topology.

After calling 'mvn clean package' , you can start the topology in local mode with  

storm jar target/<INSERTJARNAMEHERE>.jar  org.apache.storm.flux.Flux --local crawler.flux

Sampler aggregation spout 

We added a new type of spout to the Elasticsearch module which uses the sampler aggregation - a new feature in Elasticsearch 2.x. This spout is useful for cases where the status index is very large as it reduces the time taken by the queries while preserving the diversity of URLs.

URL filter based on metadata

A new configurable URL filter based on metadata had been added and is included in the default topology generated by the archetype. This filter is ridiculously simple : it removes any outlinks based on the metadata of the source document. Imagine for instance that we get URLs from sitemaps files for a given site. We could decide not to follow the outlinks found in the leafs documents from the sitemap, which is a reasonable thing to do : if a site tells you what to index, there is a possibility that you'd only get noise / variants / duplicates by following the outlinks. Since leaf documents get the feature isSitemap with the value of false, we can configure the URL filter as follows :

{ "class": "com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter",
"name": "MetadataFilter",
"params": {
"isSitemap": "false"
}
}
This mechanism can be used for other things of course.

What next?

The next release will probably contain code and resources for fetching with the Selenium protocol or JbrowserDriver (#144). We might also improve the WARC related code. As usual the project evolves with the needs and contributions of the community.

Thanks

Since StormCrawler just passed a major milestone, it is a good time to thank all the committers, contributors past and present and users for helping make the project what it is today. I've had some very positive feedback recently from new users and I hope some of you will take the time to share their experiences with the rest of the community.

Happy crawling!

Julien