Friday, 28 November 2014

Generating a test corpus for Apache Tika from CommonCrawl : Behemoth to the rescue!

It's been a while since I last blogged, in particular about Behemoth.  For those who don't know about it, Behemoth is an open source project under Apache license which helps with large scale processing of documents by providing wrappers for various libraries (Tika, UIMA, GATE etc...), a common document representation used by these wrappers and some utility classes for manipulating the dataset generated by it. Behemoth runs on Hadoop and has been used in various projects over the years. I have started working on an equivalent for Apache Spark called Azazello (to continue with the same litterary reference) but it is still early days.

I have been using Behemoth in the last couple of days to help with TIKA-1302. What we are trying to do there is to have as large a test dataset as possible for Tika. We thought it would be interesting to use data from Common Crawl, in order to get (1) loads of it (2) things seen in the wild (3) various formats. 

Behemoth steps

Luckily Behemoth can process WARC files such as the ones generated by Common Crawl with its IO module. Assuming you have cloned the source code of Behemoth, compiled it with Maven and have Hadoop installed all you need to do is call : 

hadoop jar io/target/behemoth-io-*-SNAPSHOT-job.jar -D fs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY -D fs.s3n.awsSecretAccessKey=$AWS_SECRET_KEY -D document.filter.mimetype.keep=.+[^html]  s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00099-ip-10-60-113-184.ec2.internal.warc.gz behemothCorpus

Please note the document.filter.mimetype.keep=.+[^html] parameter, this allows us to filter the input documents and keep only the ones that do not have html in their mimetype (as returned by the web servers).

The command above will generate a Hadoop sequence file containing serialized BehemothDocuments. The reader command can be used to have a peek at the content of the corpus e.g.

./behemoth reader -i behemothCorpus -m | more

contentType: image/jpeg
Date: Tue, 21 Oct 2014 08:46:31 GMT
ETag: "6afff623058a88cb23a5b18c934ee8fd19192"
Server: s23.tam
Content-Type: image/jpeg
Connection: close
Content-Length: 19192
Cache-Control: max-age=604800
X-Seen-By: s23.tam_pp


The option -m displays the metadata, we could also display the binary content if we wanted to. See WIKI for the options available.

The next step is to generate an archive with the content of each file, for which we have the generic exporter command :

./behemoth exporter -i $segName -n URL -o file:///mnt/$segName -b

This gives us a a number of archives with the content of each document put in a separate file, with the URL used for its name. We can then push the resulting archives to the machine used for testing Tika.

Scaling with Amazon EMR

The commands above will work fine even on a laptop but since we are interested in processing a substantial amount of data we need a real Hadoop cluster.

I started a smallish 5 nodes Hadoop cluster with EMR, SSHed to the master, git cloned Behemoth, compiled it and pushed the segment URLs from the latest release of Common Crawl into a SQS queue then wrote a small script which pulls the segment URLs from the queue one by one, calls the WARCConverterJob then the exporter before pushing the archives to the machine used for testing Tika. The latter step is a bit of a bottleneck as it writes to the local filesystem on the master node.

On a typical segment (like s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-42/segments/1413507444312.8/) we filtered 30,369,012  and kept 431,546 documents. The top mimetypes look like this : 

166208 contentType: image/jpeg
  63097 contentType: application/pdf
  58531 contentType: text/plain
  38497 contentType: image/png
  28906 contentType: text/calendar
  10162 contentType: image/gif
   7005 contentType: audio/x-wav
   6604 contentType: application/json
   3136 contentType: text/HTML
   2932 contentType: unknown/unknown
   2799 contentType: video/x-ms-asf
   2609 contentType: image/jpg
   1868 contentType: application/zip
   1798 contentType: application/msword

The regular expression we used to filter the html documents did not take the uppercase variants into account : nevermind, it still removed most of them. 

What next?

One alternative to pushing the archives to an external server would be to run the tests with Behemoth, since it has an existing wrapper for Tika. This would make the tests completely scalable and we'd also be able to use the extra information available in the BehemothDocuments such as the mime-type returned by the servers.

We'll see how this dataset gets used in TIKA-1302There are many ways in which Behemoth can be used and it has quite a few modules available. The aim of this blog was to show how easy it is to process data on a large scale with it, with or without the CommonCrawl dataset.

By the way CommonCrawl is a great resource, please support it by donating if you can (

Monday, 16 September 2013

NUTCH FIGHT! 1.7 vs 2.2.1

We've had releases in the Nutch 2.x branch for over a year now. As I described in a previous post, the main difference with the 1.x branch is the use of Apache Gora as a storage abstraction layer, which allows to use various flavours of NoSQL databases such as HBase, Cassandra or Accumulo as backends.

There seems to be a growing number of 2.x users, even though 1.x probably still holds the lead, and 2.x (and Gora) is being improved rapidly as a result. The venerable 1.x version has for it its reliability and a few more functionalities currently missing in 2.x, however how do they compare in terms of performance?


We have measured the performance of Nutch 1.7 against 2.2.1 (HBase and Cassandra) using 3 million URLs from the CommonCrawl project. These URLs were obtained using the Commoncrawl module in Behemoth.

For this test, we are less interested in fetching the entire contents of the 3M crawl database, but rather how performance varies when using common Nutch commands (inject / generate / parse / update). The fetch time is less relevant here as it is mainly network-bound and is affected less by the differences in storage between both versions.

Disclaimer : It is important to note that we are not comparing Cassandra and HBase themselves, but via  their respective Gora modules and any conclusions drawn are not necessarily applicable in general.  What we are presenting here is what a user gets when using Nutch 2.x with the default configuration for these backends. As we will see later, a lot depends also on the design of Nutch 2.x itself.


Nutch 1 version: apache-nutch-1.7
Nutch 2 version: apache-nutch-2.2.1
Cassandra version: cassandra-1.2.9
Hbase version: hbase-90.4

We used a large AWS EC2 instance (available at with 7.5 GB Memory with Hadoop 1.2.0 installed with Apache Whirr. Mapreduce has been configured to allow a maximum of 2 Mappers and 2 Reducers available.

Nutch  configuration

To make the different crawls comparable in terms of the handled urls, newly-discovered links on the webpages are not added to the crawl database, but we only fetch the ones that belong to the original 3M. Furthermore, we limit the number of urls per host to 100 and the size of the fetchlist to 5K.

These parameters are set in nutch-site.xml with the following properties: 

<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.

<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.

Note that the the number of urls per fetchlist can also be set in the crawl script, which we run from runtime/deploy/bin. We also removed the lines in the script which were related to indexing operations as these were less relevant for the comparison.


The results can be found in the table below where the values are the averages for each step over 3 runs. The average time per iteration excludes the fetching as explained above. The steps vary a bit between Nutch 1.x and 2.x (e.g. generation done in a singe step, inlinks computed as part of the update in 2.x) but 

Task on 3M urls/ 5K per iteration as listed in Nutch 1.x/
100 urls/host
Nutch 1.7
Time (min) per Tasks in Iteration
averaged over 3
Nutch 2.2.1 & Cassandra
Time (min) per Tasks in Iteration
averaged over 3
Nutch 2.2.1 &
Time (min) per Tasks in Iteration
averaged over 3

1.08 (last 2 it.)
Avg. per Iteration (min.)
Total Time (min.)

As we can see from these figures, Nutch 1 beats Nutch 2 with both Cassandra (N2C below) and HBase (N2H) on all tasks and with a considerable margin. Who is to be on second place is less clear, as we shall see when looking at the different steps involved in more depth. 

Injection is obviously fastest in N1 (17 minutes), while, N2H takes almost double the amount of time for the same task (27 minutes), but this is still exceeded by N2C, where Injection takes a staggeringly 85 minutes. 

However, it makes up for it during the iterations, where on average it is about 10 minutes faster than N2H. So if we were to run the crawl with more iterations, the longer time taken for injection (as this is only done once) would take less weight in the total. Within the iteration and except for the updating step, N2C is usually faster than N2H. 

The distribution of Mappers and Reducers for each task also stays constant over all iterations with N2C, while with N2Hbase data seems to be partitioned differently in each iteration and more Mappers are required  as the crawl goes on. This results in a longer processing time as our Hadoop setup allows only up to 2 mappers to be used at the same time. Curiously, this increase in the number of mappers was for the same number of entries as input. 

The number of mappers used by N2H and N2C is the main explanation for the differences between them. To give an example, the generation step in the first iteration took 11.6 minutes with N2C whereas N2H required 20 minutes. The latter had its input represented by 3 Mappers whereas the former required only 2 mappers. The mapping part would have certainly taken a lot less time if it had been forced into 2 mappers with a larger input (or if our cluster allowed more than 2 mappers / reducers) .

Besides the way data are stored, Nutch 1 and 2 differ by their storage strategies. Nutch 1.x has a concept of segments corresponding to a fetchlist, i.e. one round of crawling, separate from the data structure containing the status of the URLs (crawldb) whereas Nutch 2.x stores everything in a single, table-like structure. 

The implications of this is that the fetching and parsing steps of Nutch 1.x have the segments as input (i.e 5K URLs in our tests), whereas in Nutch 2.x the whole table (3M URLs) is the input. The way GORA currently operates is that all the entries are given by the backends then filtered on the client side before being submitted as input to the MapReduce Job. When the number of URLs in the table is large, a substantial amount of time is spent getting the content from the backends, filtering them as a preamble to the MapReduce job and discarding most of them in the process as they are not in the current fetchlist.

There is a JIRA issue in GORA  about filtering the content on the backend side which would certainly improve things but it does not seem to have been worked on for quite some time. 


Although more flexibility in terms of storage is attractive, at the moment this still seems to come at the price of a much lower performance compared to Nutch 1.x, which is also simpler to setup as it does not required to configure GORA and the backends (and the corresponding knowledge and skills).

This also has an impact on the hardware that can be used, as running HBase or Cassandra has an impact on the RAM required. We initially ran this test on a slightly dated laptop (3GB RAM) and could not get it to work successfully with HBase nor Cassandra. The same crawl with Nutch 1.7 ran fine.

Nutch 1.x also has the advantage of having been around for much longer and as a result is a lot more reliable. It also has some features currently missing from 2.x (e.g. pluggable indexing backends).

We can expect the performance in Nutch 2.x to improve a lot as GORA gets more features such as the one mentioned above.

We ran this test on a single server in pseudo distributed mode, but it would be interesting to see what happens on a properly distributed setup. 

Monday, 29 July 2013

Nutch training course

We are planning to run a 2-day training courses on Apache Nutch on the 24/25 October 2013. It will take place in Bristol, UK (the exact venue will be announced later). 

The course has been put on hold for now. Please do get in touch if you are interested and I will keep you updated as soon as we reach a sufficient number of attendees.

The course will cover pretty much everything about Nutch from installation and configuration to writing custom resources and will cover both Nutch 1.x and 2.x. The students will learn about best practices for running and managing a Nutch crawl. 

Attendees should have some knowledge of JAVA and be comfortable with command line tools to execute basic commands. Some understanding of Hadoop is a plus but not a strict requirement. The course will consist in some hands-on exercises : bring your laptop! Note that the demonstrations and exercises will be based on a Linux OS.

The program given here is an indication only and might change slightly. Feel free to suggest things that you'd like to learn during the course. 


  • Basic setup
  • Compilation and dependencies
  • Main concepts and operational steps
  • Nutch data structures
  • Parsing
  • Indexing
  • Scoring
  • Best practices for development and in production 


  • Plugin architecture
  • Politeness and performance
  • Metadata in Nutch
  • Advanced use cases
  • Introduction to Nutch 2.x

Please contact us on if you have a question or want to be kept informed of the next date for this course.

Wednesday, 5 June 2013

DigitalPebble is hiring!

We are looking for a candidate with the following skills and expertise :

    * experience in web crawling, ideally with Apache Nutch
    * Storm, Hadoop and related technologies
    * strong Java skills
    * interest in text processing, NLP and ML
    * good social and presentation skills
    * good spoken and written English, knowledge of other languages would be a plus
    * taste for challenges and problem solving

DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering. We provide our expertise in fields such as from web crawling, NLP, ML and Search, with a focus on Open Source and Big Data.

More details on our activities can be found on our website. The position is in Bristol, UK.

This job is an opportunity to get involved in the growth of a small company, be a key player in interesting projects with our clients and work on open source software. Bristol is also a great place to live.

Please send your CV and cover letter to before the 30th June 2013.

Friday, 8 March 2013

Free your Nutch crawls with pluggable indexers

I have just committed what should be a very important new feature of the next 1.x release of Apache Nutch, namely the possibility to implement indexing backends via plugins. This is currently on the trunk only but should hopefully be ported to 2.x at some point. The Nutch-1047 JIRA issue contains a history of patches and discussions for this feature.

As you'll see by reading the explanations below, this is not the same thing as the indexing filters or the storage backends in Nutch 2.x.

Historically, Nutch used to manage its own Lucene indices itself and provide a web interface for querying them. Support for SOLR was added much later in the 1.0 release (NUTCH-442) and users had two separate commands for indexing directly with Lucene or sending the documents to SOLR, in which case the search could be done outside the Nutch search servers and directly with SOLR. We then decided to drop the Nutch search servers and the Lucene-based indexing altogether in Nutch 1.3 (NUTCH-837) and let the SOLR indexer become the only option. This was an excellent move as it greatly reduced the amount of code we had to look after and meant that we could focus on the crawling while benefiting from the advances in SOLR.

One of the nice things about Nutch is that most of its components are based on plugins. The actual plugin mechanism was borrowed from Eclipse and allows to have endpoints and extensions. Nutch has extension points for URLFilters, URLNormalizers, Parsers, Protocols, etc... The full list of Nutch extensions can be found here. Basically pretty much everything in Nutch is done via plugins and I found that most customisations of Nutch I do for my clients are usually implemented via plugins only.

As you've guessed, NUTCH-1047 is about having generic commands for indexing and handling the backend implementations via plugins. Instead of piggybacking the SOLR indexer code to send the documents to a different backend, one can now use the brand new generic IndexingJob and isolate the logic of how the documents are sent to the backend via an extension of the new IndexWriter endpoint in a custom plugin.

The IndexWriter interface is pretty straightforward :
public String describe();
public void open(JobConf job, String name) throws IOException;
public void write(NutchDocument doc) throws IOException;
public void delete(String key) throws IOException;
public void update(NutchDocument doc) throws IOException;
public void commit() throws IOException;
public void close() throws IOException;
Having this mechanism allows us to move most of the SOLR-specific code to the new indexer-solr plugin (and hopefully all of it as soon as we have a generic de-duplicator which could use the IndexWriter plugins) but more importantly will facilitate the implementation of popular indexing backends such as ElasticSearch or Amazon's CloudSearch service without making the core code of Nutch more complex. We frequently get people on the mailing list asking how to store the Nutch documents on such or such database and being able to do that in a plugin will definitely make it easier. It will also be a good way of storing Nutch documents as files, etc...

This is quite a big change to the architecture of Nutch but we tried to make it as transparent as possible for end users. The only indexer plugin currently available is a port of the existing code for SOLR and is activated by default. We left the old solr* commands and modified them so that they use the generic commands with the indexing plugins in the background so from a user point of view there should be no difference at all. 

There is already a JIRA for a text-based CSV indexing plugin and I expect that the ElasticSearch one will get rapid adoption.

I had been willing to find the time to work on this for quite some time and I'm very pleased it is now committed, thanks to the comments and reviews I got from my fellow Nutch developers. I look forward to getting more feedback and seeing it being used, extended, improved, etc... 

Wednesday, 5 September 2012

Using Behemoth on the CommonCrawl dataset

Behemoth is an open-source platform for document processing based on Hadoop which provides an excellent way to process document collections on a large scale, such as crawled pages obtained with Nutch or CommonCrawl. 

Today, we are going to use a segment of the CommonCrawl dataset and show how to import data in Behemoth, filter on some common attributes and generate vectors for clustering with Apache Mahout.


The CommonCrawl dataset  ( is an open repository of web crawl data comprising 3.8 billion documents that are universally accessible. The data is available in different formats, the most recent one separating the raw content (ARC) from the metadata in JSON and the text (HTML only).

The ARC and text formats can be handled by the CommonCrawl module in Behemoth. 
This module converts CommonCrawl data to SequenceFiles of BehemothDocuments.The difference between the documents obtained in one format or the other lies in what is added in the BehemothDocs, which is binary content for the ARC and text for the text format.

In order to access this source, you will need to get an AWS (Amazon Web Services) account, because using this data is non-free.

What you need to set up for this step:

Getting the data

Once Behemoth and its module for CommonCrawl have been installed, we can go to the command line and “cd” into the behemoth-commoncrawl folder: 

We get the data from CommonCrawl and convert it into a Behemoth corpus: 

hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.commoncrawl.CommonCrawlConverterJob2012 -D fs.s3n.awsAccessKeyId={YOUR_OWN_ID} fs.s3n.awsSecretAccessKey={YOUR_OWN_KEY}   -D document.filter.mimetype.keep=application/pdf s3n://aws-publicdatasets/common-crawl/parse-output/segment/1350433107106/* test-crawlpdf  

In this example, we filter on the mime type, since we only want to import pdf documents.
By setting the filter: -D document.filter.mimetype.keep=application/pdf, we limit what is imported from CommonCrawl. The filter takes a regular expression and will import only those documents whose mime type matches the regular expression. Note that it is possible to filter based on other things such as the URL, the length of the document or any other metadata.

In order to inspect, what has just been imported, we can now call the CorpusReader and look at the content of the Behemoth sequence file.  

hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i test-crawlpdf -c -t

The document corpus excerpt gives you some information on the source of the document, the content type and shows the first lines of the binary content (parameter -c). Note the parameter -t which displays the text for the document, however since the documents were generated from the ARC, the text has not been extracted yet. 

Output after getting the data:

: application/pdf
101 0 obj <</Linearized 1/L 527200/O 104/E 88751/N 12/T 525137/H [ 736 441]>>
101 22
0000000016 00000 n
0000001177 00000 n
0000001293 00000 n
0000001418 0

Text Extraction

To then obtain the text, we use the Tika module in Behemoth which extracts the text from the documents in a Behemoth sequence file.

hadoop jar ./behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver -i test-crawlpdf/* -o crawlpdf-Tika

Now, we again inspect the corpus and see the extracted text content (omitting the parameter -c)

Output after extracting the text content:

contentType: application/pdf

777 Harrah’s Rincon Way  
Valley Center, CA 92082  


Prices, hours of operation and treatments are subject to change.
Must be 21 or older to gamble. 

While Tika extracts the text content, it also generates annotations representing the original markup of a document (if present) and its metadata, which can be displayed with the parameters -m and -a.

Filtering on Language

Since, for the sake of argument,  we are only interested in the English documents in the corpus, we need to filter out all those which are in a different language. The language identification module uses the LangDetect library ( to identify and add language IDs to each document.

We identify the language with:

(1) hadoop jar ./behemoth-lang*job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -i crawlpdf-Tika -o crawlpdf-Tika-lang

From the command line output or the hadoop jobtracker, one can see the distribution of languages in a corpus: 

Then we can filter on the language ID, in this case 'en' - for  English.

After having identified the languages, the filtering can be done either by using the CorpusFilter from the core module:

hadoop jar behemoth-core*-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusFilter -D -i crawlpdf-Tika-lang -o crawlpdf-Tika-EN

Note that we could have done the same as part of the language identification step with :

(2) hadoop jar behemoth-lang*-SNAPSHOT-job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -D -i crawlpdf-Tika -o crawlpdf-Tika-EN

If you are only interested in filtering, the first step shown here is optional - the identification and filtering can be done in one step as shown in (2). The corresponding jobtracker output would look like this: 



Having filtered out all unwanted documents, we create the vectors representing the Behemoth documents, thanks to the resources in the Mahout module : 

hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth -i crawlpdf-Tika-EN -o crawl-pdf-vec --namedVector

Having successfully finished the preprocessing and vector generation with Behemoth, we now change to Mahout (available here) to do the clustering.

Using kmeans clustering in Mahout, there are two ways of generating the initial clusters:

1) One can specify the desired number of output clusters and the initial centroids are generated as a first step in kmeans. This will probably be best, if you do not know your data very well, but do know how many clusters you want to have in the end. 

2) Another option is to use canopy clustering, where you define a minimal distance between the centroids and the number of clusters depends on that distance and obviously also on the distance measure used.
There are ways to calculate the average distance between vectors in your corpus beforehand: 
Using more appropriate values for the distance will probably give a more representative clustering result. 

From your mahout folder: 

Thus, creating the initial centroids with canopy clustering:

mahout canopy -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/canopy-centroids -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.1 -t2 0.5 -cl

Then you call kmeans, while specifying the newly-generated canopy-centroids in the c-argument. The distance measure used here is Tanimoto, which takes into account the document length.

mahout kmeans -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/clusters -c crawl-pdf-vec/canopy-centroids/clusters-0-final  -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -x 10 -cd 0.1 -cl

Since we’re interested in seeing what documents are allocated to which cluster, the ClusterDocIDDumper in the Mahout module in Behemoth comes in useful: 

hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.util.ClusterDocIDDumper -i crawl-pdf-vec/clusters/clusteredPoints  -o crawl-pdf-vec/clusterID

To extract the results to the local file system:

hadoop fs -text crawl-pdf-vec/clusterID > crawlpdf-clusterID

et voila:

….    6    37    19    23    42    43    44    10


This was merely an exercise meant to illustrate some of the capabilities of Behemoth and how it could be used to process the CommonCrawl dataset. There are  more modules  available, such as the GATE or UIMA ones that we could have used to extract named entities, or the SOLR module to index the documents. 

We actually used the CommonCrawl dataset with Behemoth for one of our clients in order to identify CVs automatically using our  text classification module alongside the Tika, GATE and Language ID modules. This was a great way of checking some of our assumptions before applying the same processes to the output of a Nutch crawl. CommonCrawl is a great resource and if you need to do some text processing on its content, it's very likely that Behemoth and that at least one of its existing modules should be useful.