DigitalPebble's Blog: Generating a test corpus for Apache Tika from CommonCrawl : Behemoth to the rescue!

It's been a while since I last blogged, in particular about Behemoth. For those who don't know about it, Behemoth is an open source project under Apache license which helps with large scale processing of documents by providing wrappers for various libraries (Tika, UIMA, GATE etc...), a common document representation used by these wrappers and some utility classes for manipulating the dataset generated by it. Behemoth runs on Hadoop and has been used in various projects over the years. I have started working on an equivalent for Apache Spark called Azazello (to continue with the same litterary reference) but it is still early days.

I have been using Behemoth in the last couple of days to help with TIKA-1302. What we are trying to do there is to have as large a test dataset as possible for Tika. We thought it would be interesting to use data from Common Crawl, in order to get (1) loads of it (2) things seen in the wild (3) various formats.

Behemoth steps

Luckily Behemoth can process WARC files such as the ones generated by Common Crawl with its IO module. Assuming you have cloned the source code of Behemoth, compiled it with Maven and have Hadoop installed all you need to do is call :

hadoop jar io/target/behemoth-io-*-SNAPSHOT-job.jar com.digitalpebble.behemoth.io.warc.WARCConverterJob -D fs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY -D fs.s3n.awsSecretAccessKey=$AWS_SECRET_KEY -D document.filter.mimetype.keep=.+[^html] s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00099-ip-10-60-113-184.ec2.internal.warc.gz behemothCorpus

Please note the document.filter.mimetype.keep=.+[^html] parameter, this allows us to filter the input documents and keep only the ones that do not have html in their mimetype (as returned by the web servers).

The command above will generate a Hadoop sequence file containing serialized BehemothDocuments. The reader command can be used to have a peek at the content of the corpus e.g.

./behemoth reader -i behemothCorpus -m | more

url: http://0.static.wix.com/dicons/7ffb03_2f63cf23ec1107e4ed9824f6c98e5847.wix_doc_ico

contentType: image/jpeg

metadata:

Date: Tue, 21 Oct 2014 08:46:31 GMT

ETag: "6afff623058a88cb23a5b18c934ee8fd19192"

Server: s23.tam

Content-Type: image/jpeg

Connection: close

Content-Length: 19192

Cache-Control: max-age=604800

X-Seen-By: s23.tam_pp

IP: 207.36.47.4

[...]

The option -m displays the metadata, we could also display the binary content if we wanted to. See WIKI for the options available.

The next step is to generate an archive with the content of each file, for which we have the generic exporter command :

./behemoth exporter -i $segName -n URL -o file:///mnt/$segName -b

This gives us a a number of archives with the content of each document put in a separate file, with the URL used for its name. We can then push the resulting archives to the machine used for testing Tika.

Scaling with Amazon EMR

The commands above will work fine even on a laptop but since we are interested in processing a substantial amount of data we need a real Hadoop cluster.

I started a smallish 5 nodes Hadoop cluster with EMR, SSHed to the master, git cloned Behemoth, compiled it and pushed the segment URLs from the latest release of Common Crawl into a SQS queue then wrote a small script which pulls the segment URLs from the queue one by one, calls the WARCConverterJob then the exporter before pushing the archives to the machine used for testing Tika. The latter step is a bit of a bottleneck as it writes to the local filesystem on the master node.

On a typical segment (like s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-42/segments/1413507444312.8/) we filtered 30,369,012 and kept 431,546 documents. The top mimetypes look like this :

166208 contentType: image/jpeg

63097 contentType: application/pdf

58531 contentType: text/plain

38497 contentType: image/png

28906 contentType: text/calendar

10162 contentType: image/gif

7005 contentType: audio/x-wav

6604 contentType: application/json

3136 contentType: text/HTML

2932 contentType: unknown/unknown

2799 contentType: video/x-ms-asf

2609 contentType: image/jpg

1868 contentType: application/zip

1798 contentType: application/msword

The regular expression we used to filter the html documents did not take the uppercase variants into account : nevermind, it still removed most of them.

What next?

One alternative to pushing the archives to an external server would be to run the tests with Behemoth, since it has an existing wrapper for Tika. This would make the tests completely scalable and we'd also be able to use the extra information available in the BehemothDocuments such as the mime-type returned by the servers.

We'll see how this dataset gets used in TIKA-1302. There are many ways in which Behemoth can be used and it has quite a few modules available. The aim of this blog was to show how easy it is to process data on a large scale with it, with or without the CommonCrawl dataset.

By the way CommonCrawl is a great resource, please support it by donating if you can (http://commoncrawl.org/donate/).

Friday, 28 November 2014

Generating a test corpus for Apache Tika from CommonCrawl : Behemoth to the rescue!

Behemoth steps

Scaling with Amazon EMR

What next?

No comments:

Post a Comment