We've recently released the version 0.4 of storm-crawler, which is a collection of resources for building low-latency, large scale web crawlers with Apache Storm.
The project has been really active in the last few months, thanks partly to our 2 fantastic new committers (Jake Dodd and Gui Forget) and as a result contains some important changes and improvements.
Reorganisation of the code
We've separated the project into two separate modules named 'core' and 'external'. External contains resources that are either specific to a given library, for instance the ElasticSearchBolt that can be used to index documents with ElasticSearch, or very generic, like our metrics related code. This simplifies the code and dependencies for the core components and makes the project easier to understand.
There are also external resources contributed by third parties, as well as a separate project (still in its infancy) which will illustrate the use of storm-crawler and provide a ready-to-use generic web crawler; whereas storm-crawler itself will remain a SDK.
We also generate a test jar and dependencies for the core module, containing code that can be reused for testing various resources.
The main components of the SDK now send tuples not only to the standard stream but also to a separate 'status' stream, which is meant to be consumed by a bespoke bolt in charge of persisting the status and metadata for the known URLs of a crawl. This is useful for recursive crawls, where new URLs are discovered during the lifetime of the topology but also for non-recursive ones e.g. for managing redirections, errors, etc...
This is used by components such as the FetcherBolt (redirections), the ParserBolt (outlinks) or the brand new SiteMapParserBolt (outlinks - see below) , in particular to handle errors, be them temporary or not. The component in charge of storing the status of the URL can then decide when a URL should be refetched or change its status, which is a better approach than failing the URL and simplifies the code for the Spouts.
The default stream is used primarily for the main content of a URL when it is successfully fetched and parsed, typically to send it to an index on ElasticSearch or SOLR (or anything else you fancy), whereas the information of the URLs (think about the crawldb if you come from Apache Nutch) can be stored somewhere else like HBase or Cassandra.
We made some of the interfaces a bit richer. The Protocol interface can now receive the metadata associated with a URL. The ParseFilters can be configured with the Storm config and the URLFilter interface has access to the source URL and its metadata, which is useful for instance to filter based on the host or domain name of the source URL (see below).
Apart from the usual upgrades of dependencies, we've also added the following resources :
- RegexURLNormalizer : in S/C this is a URLFilter
- HostURLFilter : filters URLs based on the host or domain of the source URL
- XPathFilter : extracts metadata from pages using XPath expressions
- SiteMapParserBolt: uses crawler-commons to extract outlinks from sitemap URLs, see documentation
This release contains also several bug fixes and various other improvements.
The next release should contain the introduction of a Metadata object to replace the Map<String,String> that are used everywhere in our code and combine it with KeyValues.
We'll probably add some code to make it easier for people to write bolts reading from the status stream.
I expect there will be more external resources (like a MetricsConsumer to send metrics directly to ElasticSearch), either in the external module or in spiderlet.