Friday 5 June 2015

What's new in Storm-Crawler 0.5

We've just released the version 0.5 of Storm-Crawler, just over three months after the previous one. As you can read below, we've been pretty busy! The project got some great contributions from new users and is seeing an increase in adoption, which is very encouraging.

Metadata and Outlinks

One of the main improvements provided in the new release is the introduction of a Metadata object which replaces the Map<String,String[]> that were used everywhere in our code as well as the KeyValues utility class which manipulated such Maps. This makes the code a lot simpler and more elegant.

A new MetadataTransfer class has been added to (a)  determine what metadata should be kept e.g. when persisting the information about a URL in a StatusUpdaterBolt but also to (b) determine what metadata should be transferred from the source document to its outlinks. This is a very useful feature that gets used quite often in practice.

Speaking of outlinks, they now have a proper class for representing them where the anchor and metadata for a given target URL are kept. Note that the parser bolts populate the metadata using the MetadataTransfer class described above prior to passing them to the ParseFilters. This means that a given ParseFilter can modify the outlinks for a given page or create completely new ones.


We got a present from our committer Gui whose company have kindly donated the a parsing bolt based on JSoup. This is now the one we use by default, the one based on Tika has been moved to the external part of the code. If you are crawling non-HTML pages then you should be using the Tika-based parser, otherwise the JSoup one is a lot lighter (both in code and dependencies) and works better when extracting data with XPath.

Abstract classes for persistence

We also added many useful resources for writing recursive crawlers, in addition to the status stream that came with the previous release. These can be found in the com.digitalpebble.storm.crawler.persistence  package. In particular, we added a new AbstractStatusUpdaterBolt class. As the name suggests, this is meant to be extended to store the tuples coming from the status stream into some sort of storage (e.g. Elasticsearch, SOLR, Cassandra, HBase etc...). The abstract class keeps an internal cache for newly discovered URLs so that the same URL does not get updated more than once in the backend. Obviously this cache would not outlive the bolt if it died so this should be seen merely as an optimization and not a 100% reliable filter. The abstract class then calls a Scheduler, which is a pluggable mechanism to define when a given URL should be fetched next based on its metadata and status. The default scheduler simply relies on the configuration set by the user.

We also added a new AbstractIndexerBolt class, to simplify writing indexing bolts by allowing the users to specify what metadata to index via the configuration. This greatly simplifies writing an IndexerBolt.


These new classes above have been used for our Elasticsearch bolts and spout. We now have :
These 3 components allow to build a recursive crawler with Elasticsearch. We also added an example topology to illustrate how to do this as well as an init script which defines the schemas of the indices.

As a bonus we wrote a MetricsConsumer which can be plugged into the Storm metrics mechanism so that they get indexed in Elasticsearch. This would be typically used by Kibana as a way of monitoring the performance of the crawler with the metrics generated by the spouts and bolts e.g. bytes per second, pages fetched etc... I had suggested it to the elasticsearch-hadoop community but it hasn't attracted much interest so far.

We will probably provide a schema file for Kibana so that users can load a standard dashboard for displaying the metrics. We just need to wait for the next release of Kibana which will contain #1552.

Miscellaneous and next steps

We've replaced the old http protocol implementation we'd borrowed from Nutch with a brand new one based on Apache HttpClient. Less code to maintain and it is also more robust, particularly on https pages.

Apart from that, we improved our WIKI pages, upgraded some dependencies (Tika to 1.8, ES to 1.5.1, Storm to 0.9.4), added some resources (e.g. MaxDepthFilter), removed some deprecated ones (#126) and fixed numerous bugs. 

As I said, we've been pretty busy and it looks like this is set to continue with the 0.6 release. It will probably contain #117 as well as resources for Apache SOLR.

Thanks to everyone who contributed to this release in any way.

Wednesday 28 January 2015

What's new in Storm-Crawler 0.4

We've recently released the version 0.4 of storm-crawler, which is a collection of resources for building low-latency, large scale web crawlers with Apache Storm

The project has been really active in the last few months, thanks partly to our 2 fantastic new committers (Jake Dodd and Gui Forget) and as a result contains some important changes and improvements.

Reorganisation of the code

We've separated the project into two separate modules named 'core' and 'external'. External  contains resources that are either specific to a given library, for instance the ElasticSearchBolt that can be used to index documents with ElasticSearch, or very generic, like our metrics related code. This simplifies the code and dependencies for the core components and makes the project easier to understand.

There are also external resources contributed by third parties, as well as a separate project (still in its infancy) which will illustrate the use of storm-crawler and provide a ready-to-use generic web crawler; whereas storm-crawler itself will remain a SDK.

We also generate a test jar and dependencies for the core module, containing code that can be reused for testing various resources.

Status stream

The main components of the SDK now send tuples not only to the standard stream but also to a separate 'status' stream, which is meant to be consumed by a bespoke bolt in charge of persisting the status and metadata for the known URLs of a crawl. This is useful for recursive crawls, where new URLs are discovered during the lifetime of the topology but also for non-recursive ones e.g. for managing redirections, errors, etc...

This is used by components such as the FetcherBolt (redirections), the ParserBolt (outlinks) or the brand new SiteMapParserBolt (outlinks - see below) , in particular to handle errors, be them temporary or not. The component in charge of storing the status of the URL can then decide when a URL should be refetched or change its status, which is a better approach than failing the URL and simplifies the code for the Spouts.

The default stream is used primarily for the main content of a URL when it is successfully fetched and parsed, typically to send it to an index on ElasticSearch or SOLR (or anything else you fancy), whereas the information of the URLs (think about the crawldb if you come from Apache Nutch) can be stored somewhere else like HBase or Cassandra.

Interface changes

We made some of the interfaces a bit richer. The Protocol interface can now receive the metadata associated with a URL. The ParseFilters can be configured with the Storm config and the URLFilter interface has access to the source URL and its metadata, which is useful for instance to filter based on the host or domain name of the source URL (see below).

New resources

Apart from the usual upgrades of dependencies, we've also added the following resources :
This release contains also several bug fixes and various other improvements.

What next?

The next release should contain the introduction of a Metadata object to replace the Map<String,String[]> that are used everywhere in our code and combine it with KeyValues.

We'll probably add some code to make it easier for people to write bolts reading from the status stream.

I expect there will be more external resources (like a MetricsConsumer to send metrics directly to ElasticSearch), either in the external module or in spiderlet.

Friday 28 November 2014

Generating a test corpus for Apache Tika from CommonCrawl : Behemoth to the rescue!

It's been a while since I last blogged, in particular about Behemoth.  For those who don't know about it, Behemoth is an open source project under Apache license which helps with large scale processing of documents by providing wrappers for various libraries (Tika, UIMA, GATE etc...), a common document representation used by these wrappers and some utility classes for manipulating the dataset generated by it. Behemoth runs on Hadoop and has been used in various projects over the years. I have started working on an equivalent for Apache Spark called Azazello (to continue with the same litterary reference) but it is still early days.

I have been using Behemoth in the last couple of days to help with TIKA-1302. What we are trying to do there is to have as large a test dataset as possible for Tika. We thought it would be interesting to use data from Common Crawl, in order to get (1) loads of it (2) things seen in the wild (3) various formats. 

Behemoth steps

Luckily Behemoth can process WARC files such as the ones generated by Common Crawl with its IO module. Assuming you have cloned the source code of Behemoth, compiled it with Maven and have Hadoop installed all you need to do is call : 

hadoop jar io/target/behemoth-io-*-SNAPSHOT-job.jar -D fs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY -D fs.s3n.awsSecretAccessKey=$AWS_SECRET_KEY -D document.filter.mimetype.keep=.+[^html]  s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00099-ip-10-60-113-184.ec2.internal.warc.gz behemothCorpus

Please note the document.filter.mimetype.keep=.+[^html] parameter, this allows us to filter the input documents and keep only the ones that do not have html in their mimetype (as returned by the web servers).

The command above will generate a Hadoop sequence file containing serialized BehemothDocuments. The reader command can be used to have a peek at the content of the corpus e.g.

./behemoth reader -i behemothCorpus -m | more

contentType: image/jpeg
Date: Tue, 21 Oct 2014 08:46:31 GMT
ETag: "6afff623058a88cb23a5b18c934ee8fd19192"
Server: s23.tam
Content-Type: image/jpeg
Connection: close
Content-Length: 19192
Cache-Control: max-age=604800
X-Seen-By: s23.tam_pp


The option -m displays the metadata, we could also display the binary content if we wanted to. See WIKI for the options available.

The next step is to generate an archive with the content of each file, for which we have the generic exporter command :

./behemoth exporter -i $segName -n URL -o file:///mnt/$segName -b

This gives us a a number of archives with the content of each document put in a separate file, with the URL used for its name. We can then push the resulting archives to the machine used for testing Tika.

Scaling with Amazon EMR

The commands above will work fine even on a laptop but since we are interested in processing a substantial amount of data we need a real Hadoop cluster.

I started a smallish 5 nodes Hadoop cluster with EMR, SSHed to the master, git cloned Behemoth, compiled it and pushed the segment URLs from the latest release of Common Crawl into a SQS queue then wrote a small script which pulls the segment URLs from the queue one by one, calls the WARCConverterJob then the exporter before pushing the archives to the machine used for testing Tika. The latter step is a bit of a bottleneck as it writes to the local filesystem on the master node.

On a typical segment (like s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-42/segments/1413507444312.8/) we filtered 30,369,012  and kept 431,546 documents. The top mimetypes look like this : 

166208 contentType: image/jpeg
  63097 contentType: application/pdf
  58531 contentType: text/plain
  38497 contentType: image/png
  28906 contentType: text/calendar
  10162 contentType: image/gif
   7005 contentType: audio/x-wav
   6604 contentType: application/json
   3136 contentType: text/HTML
   2932 contentType: unknown/unknown
   2799 contentType: video/x-ms-asf
   2609 contentType: image/jpg
   1868 contentType: application/zip
   1798 contentType: application/msword

The regular expression we used to filter the html documents did not take the uppercase variants into account : nevermind, it still removed most of them. 

What next?

One alternative to pushing the archives to an external server would be to run the tests with Behemoth, since it has an existing wrapper for Tika. This would make the tests completely scalable and we'd also be able to use the extra information available in the BehemothDocuments such as the mime-type returned by the servers.

We'll see how this dataset gets used in TIKA-1302There are many ways in which Behemoth can be used and it has quite a few modules available. The aim of this blog was to show how easy it is to process data on a large scale with it, with or without the CommonCrawl dataset.

By the way CommonCrawl is a great resource, please support it by donating if you can (

Monday 16 September 2013

NUTCH FIGHT! 1.7 vs 2.2.1

We've had releases in the Nutch 2.x branch for over a year now. As I described in a previous post, the main difference with the 1.x branch is the use of Apache Gora as a storage abstraction layer, which allows to use various flavours of NoSQL databases such as HBase, Cassandra or Accumulo as backends.

There seems to be a growing number of 2.x users, even though 1.x probably still holds the lead, and 2.x (and Gora) is being improved rapidly as a result. The venerable 1.x version has for it its reliability and a few more functionalities currently missing in 2.x, however how do they compare in terms of performance?


We have measured the performance of Nutch 1.7 against 2.2.1 (HBase and Cassandra) using 3 million URLs from the CommonCrawl project. These URLs were obtained using the Commoncrawl module in Behemoth.

For this test, we are less interested in fetching the entire contents of the 3M crawl database, but rather how performance varies when using common Nutch commands (inject / generate / parse / update). The fetch time is less relevant here as it is mainly network-bound and is affected less by the differences in storage between both versions.

Disclaimer : It is important to note that we are not comparing Cassandra and HBase themselves, but via  their respective Gora modules and any conclusions drawn are not necessarily applicable in general.  What we are presenting here is what a user gets when using Nutch 2.x with the default configuration for these backends. As we will see later, a lot depends also on the design of Nutch 2.x itself.


Nutch 1 version: apache-nutch-1.7
Nutch 2 version: apache-nutch-2.2.1
Cassandra version: cassandra-1.2.9
Hbase version: hbase-90.4

We used a large AWS EC2 instance (available at with 7.5 GB Memory with Hadoop 1.2.0 installed with Apache Whirr. Mapreduce has been configured to allow a maximum of 2 Mappers and 2 Reducers available.

Nutch  configuration

To make the different crawls comparable in terms of the handled urls, newly-discovered links on the webpages are not added to the crawl database, but we only fetch the ones that belong to the original 3M. Furthermore, we limit the number of urls per host to 100 and the size of the fetchlist to 5K.

These parameters are set in nutch-site.xml with the following properties: 

<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.

<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.

Note that the the number of urls per fetchlist can also be set in the crawl script, which we run from runtime/deploy/bin. We also removed the lines in the script which were related to indexing operations as these were less relevant for the comparison.


The results can be found in the table below where the values are the averages for each step over 3 runs. The average time per iteration excludes the fetching as explained above. The steps vary a bit between Nutch 1.x and 2.x (e.g. generation done in a singe step, inlinks computed as part of the update in 2.x) but 

Task on 3M urls/ 5K per iteration as listed in Nutch 1.x/
100 urls/host
Nutch 1.7
Time (min) per Tasks in Iteration
averaged over 3
Nutch 2.2.1 & Cassandra
Time (min) per Tasks in Iteration
averaged over 3
Nutch 2.2.1 &
Time (min) per Tasks in Iteration
averaged over 3

1.08 (last 2 it.)
Avg. per Iteration (min.)
Total Time (min.)

As we can see from these figures, Nutch 1 beats Nutch 2 with both Cassandra (N2C below) and HBase (N2H) on all tasks and with a considerable margin. Who is to be on second place is less clear, as we shall see when looking at the different steps involved in more depth. 

Injection is obviously fastest in N1 (17 minutes), while, N2H takes almost double the amount of time for the same task (27 minutes), but this is still exceeded by N2C, where Injection takes a staggeringly 85 minutes. 

However, it makes up for it during the iterations, where on average it is about 10 minutes faster than N2H. So if we were to run the crawl with more iterations, the longer time taken for injection (as this is only done once) would take less weight in the total. Within the iteration and except for the updating step, N2C is usually faster than N2H. 

The distribution of Mappers and Reducers for each task also stays constant over all iterations with N2C, while with N2Hbase data seems to be partitioned differently in each iteration and more Mappers are required  as the crawl goes on. This results in a longer processing time as our Hadoop setup allows only up to 2 mappers to be used at the same time. Curiously, this increase in the number of mappers was for the same number of entries as input. 

The number of mappers used by N2H and N2C is the main explanation for the differences between them. To give an example, the generation step in the first iteration took 11.6 minutes with N2C whereas N2H required 20 minutes. The latter had its input represented by 3 Mappers whereas the former required only 2 mappers. The mapping part would have certainly taken a lot less time if it had been forced into 2 mappers with a larger input (or if our cluster allowed more than 2 mappers / reducers) .

Besides the way data are stored, Nutch 1 and 2 differ by their storage strategies. Nutch 1.x has a concept of segments corresponding to a fetchlist, i.e. one round of crawling, separate from the data structure containing the status of the URLs (crawldb) whereas Nutch 2.x stores everything in a single, table-like structure. 

The implications of this is that the fetching and parsing steps of Nutch 1.x have the segments as input (i.e 5K URLs in our tests), whereas in Nutch 2.x the whole table (3M URLs) is the input. The way GORA currently operates is that all the entries are given by the backends then filtered on the client side before being submitted as input to the MapReduce Job. When the number of URLs in the table is large, a substantial amount of time is spent getting the content from the backends, filtering them as a preamble to the MapReduce job and discarding most of them in the process as they are not in the current fetchlist.

There is a JIRA issue in GORA  about filtering the content on the backend side which would certainly improve things but it does not seem to have been worked on for quite some time. 


Although more flexibility in terms of storage is attractive, at the moment this still seems to come at the price of a much lower performance compared to Nutch 1.x, which is also simpler to setup as it does not required to configure GORA and the backends (and the corresponding knowledge and skills).

This also has an impact on the hardware that can be used, as running HBase or Cassandra has an impact on the RAM required. We initially ran this test on a slightly dated laptop (3GB RAM) and could not get it to work successfully with HBase nor Cassandra. The same crawl with Nutch 1.7 ran fine.

Nutch 1.x also has the advantage of having been around for much longer and as a result is a lot more reliable. It also has some features currently missing from 2.x (e.g. pluggable indexing backends).

We can expect the performance in Nutch 2.x to improve a lot as GORA gets more features such as the one mentioned above.

We ran this test on a single server in pseudo distributed mode, but it would be interesting to see what happens on a properly distributed setup. 

Monday 29 July 2013

Nutch training course

We are planning to run a 2-day training courses on Apache Nutch on the 24/25 October 2013. It will take place in Bristol, UK (the exact venue will be announced later). 

The course has been put on hold for now. Please do get in touch if you are interested and I will keep you updated as soon as we reach a sufficient number of attendees.

The course will cover pretty much everything about Nutch from installation and configuration to writing custom resources and will cover both Nutch 1.x and 2.x. The students will learn about best practices for running and managing a Nutch crawl. 

Attendees should have some knowledge of JAVA and be comfortable with command line tools to execute basic commands. Some understanding of Hadoop is a plus but not a strict requirement. The course will consist in some hands-on exercises : bring your laptop! Note that the demonstrations and exercises will be based on a Linux OS.

The program given here is an indication only and might change slightly. Feel free to suggest things that you'd like to learn during the course. 


  • Basic setup
  • Compilation and dependencies
  • Main concepts and operational steps
  • Nutch data structures
  • Parsing
  • Indexing
  • Scoring
  • Best practices for development and in production 


  • Plugin architecture
  • Politeness and performance
  • Metadata in Nutch
  • Advanced use cases
  • Introduction to Nutch 2.x

Please contact us on if you have a question or want to be kept informed of the next date for this course.

Wednesday 5 June 2013

DigitalPebble is hiring!

We are looking for a candidate with the following skills and expertise :

    * experience in web crawling, ideally with Apache Nutch
    * Storm, Hadoop and related technologies
    * strong Java skills
    * interest in text processing, NLP and ML
    * good social and presentation skills
    * good spoken and written English, knowledge of other languages would be a plus
    * taste for challenges and problem solving

DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering. We provide our expertise in fields such as from web crawling, NLP, ML and Search, with a focus on Open Source and Big Data.

More details on our activities can be found on our website. The position is in Bristol, UK.

This job is an opportunity to get involved in the growth of a small company, be a key player in interesting projects with our clients and work on open source software. Bristol is also a great place to live.

Please send your CV and cover letter to before the 30th June 2013.

Friday 8 March 2013

Free your Nutch crawls with pluggable indexers

I have just committed what should be a very important new feature of the next 1.x release of Apache Nutch, namely the possibility to implement indexing backends via plugins. This is currently on the trunk only but should hopefully be ported to 2.x at some point. The Nutch-1047 JIRA issue contains a history of patches and discussions for this feature.

As you'll see by reading the explanations below, this is not the same thing as the indexing filters or the storage backends in Nutch 2.x.

Historically, Nutch used to manage its own Lucene indices itself and provide a web interface for querying them. Support for SOLR was added much later in the 1.0 release (NUTCH-442) and users had two separate commands for indexing directly with Lucene or sending the documents to SOLR, in which case the search could be done outside the Nutch search servers and directly with SOLR. We then decided to drop the Nutch search servers and the Lucene-based indexing altogether in Nutch 1.3 (NUTCH-837) and let the SOLR indexer become the only option. This was an excellent move as it greatly reduced the amount of code we had to look after and meant that we could focus on the crawling while benefiting from the advances in SOLR.

One of the nice things about Nutch is that most of its components are based on plugins. The actual plugin mechanism was borrowed from Eclipse and allows to have endpoints and extensions. Nutch has extension points for URLFilters, URLNormalizers, Parsers, Protocols, etc... The full list of Nutch extensions can be found here. Basically pretty much everything in Nutch is done via plugins and I found that most customisations of Nutch I do for my clients are usually implemented via plugins only.

As you've guessed, NUTCH-1047 is about having generic commands for indexing and handling the backend implementations via plugins. Instead of piggybacking the SOLR indexer code to send the documents to a different backend, one can now use the brand new generic IndexingJob and isolate the logic of how the documents are sent to the backend via an extension of the new IndexWriter endpoint in a custom plugin.

The IndexWriter interface is pretty straightforward :
public String describe();
public void open(JobConf job, String name) throws IOException;
public void write(NutchDocument doc) throws IOException;
public void delete(String key) throws IOException;
public void update(NutchDocument doc) throws IOException;
public void commit() throws IOException;
public void close() throws IOException;
Having this mechanism allows us to move most of the SOLR-specific code to the new indexer-solr plugin (and hopefully all of it as soon as we have a generic de-duplicator which could use the IndexWriter plugins) but more importantly will facilitate the implementation of popular indexing backends such as ElasticSearch or Amazon's CloudSearch service without making the core code of Nutch more complex. We frequently get people on the mailing list asking how to store the Nutch documents on such or such database and being able to do that in a plugin will definitely make it easier. It will also be a good way of storing Nutch documents as files, etc...

This is quite a big change to the architecture of Nutch but we tried to make it as transparent as possible for end users. The only indexer plugin currently available is a port of the existing code for SOLR and is activated by default. We left the old solr* commands and modified them so that they use the generic commands with the indexing plugins in the background so from a user point of view there should be no difference at all. 

There is already a JIRA for a text-based CSV indexing plugin and I expect that the ElasticSearch one will get rapid adoption.

I had been willing to find the time to work on this for quite some time and I'm very pleased it is now committed, thanks to the comments and reviews I got from my fellow Nutch developers. I look forward to getting more feedback and seeing it being used, extended, improved, etc...