DigitalPebble's Blog

Wednesday, 5 September 2012

Using Behemoth on the CommonCrawl dataset

Behemoth is an open-source platform for document processing based on Hadoop which provides an excellent way to process document collections on a large scale, such as crawled pages obtained with Nutch or CommonCrawl.

Today, we are going to use a segment of the CommonCrawl dataset and show how to import data in Behemoth, filter on some common attributes and generate vectors for clustering with Apache Mahout.

CommonCrawl

The CommonCrawl dataset (http://commoncrawl.org/) is an open repository of web crawl data comprising 3.8 billion documents that are universally accessible. The data is available in different formats, the most recent one separating the raw content (ARC) from the metadata in JSON and the text (HTML only).

The ARC and text formats can be handled by the CommonCrawl module in Behemoth.

This module converts CommonCrawl data to SequenceFiles of BehemothDocuments.The difference between the documents obtained in one format or the other lies in what is added in the BehemothDocs, which is binary content for the ARC and text for the text format.

In order to access this source, you will need to get an AWS (Amazon Web Services) account, because using this data is non-free.

What you need to set up for this step:
https://github.com/DigitalPebble/behemoth
https://github.com/DigitalPebble/behemoth-commoncrawl

Getting the data

Once Behemoth and its module for CommonCrawl have been installed, we can go to the command line and “cd” into the behemoth-commoncrawl folder:

We get the data from CommonCrawl and convert it into a Behemoth corpus:

hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.commoncrawl.CommonCrawlConverterJob2012 -D fs.s3n.awsAccessKeyId={YOUR_OWN_ID} fs.s3n.awsSecretAccessKey={YOUR_OWN_KEY} -D document.filter.mimetype.keep=application/pdf s3n://aws-publicdatasets/common-crawl/parse-output/segment/1350433107106/* test-crawlpdf

In this example, we filter on the mime type, since we only want to import pdf documents.

By setting the filter: -D document.filter.mimetype.keep=application/pdf, we limit what is imported from CommonCrawl. The filter takes a regular expression and will import only those documents whose mime type matches the regular expression. Note that it is possible to filter based on other things such as the URL, the length of the document or any other metadata.

In order to inspect, what has just been imported, we can now call the CorpusReader and look at the content of the Behemoth sequence file.

hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i test-crawlpdf -c -t

The document corpus excerpt gives you some information on the source of the document, the content type and shows the first lines of the binary content (parameter -c). Note the parameter -t which displays the text for the document, however since the documents were generated from the ARC, the text has not been extracted yet.

Output after getting the data:

url: http://www.harrahsrincon.com/images/non_image_assets/RIN_New_spa_menu_web.pdf
contentType: application/pdf
Content:
%PDF-1.6
%��
101 0 obj <</Linearized 1/L 527200/O 104/E 88751/N 12/T 525137/H [ 736 441]>>
endobj

xref
101 22
0000000016 00000 n
0000001177 00000 n
0000001293 00000 n
0000001418 0

Text Extraction

To then obtain the text, we use the Tika module in Behemoth which extracts the text from the documents in a Behemoth sequence file.

hadoop jar ./behemoth-tika--job.jar com.digitalpebble.behemoth.tika.TikaDriver -i test-crawlpdf/ -o crawlpdf-Tika

Now, we again inspect the corpus and see the extracted text content (omitting the parameter -c)

Output after extracting the text content:

url: http://www.harrahsrincon.com/images/non_image_assets/RIN_New_spa_menu_web.pdf
contentType: application/pdf
Text:

777 Harrah’s Rincon Way
Valley Center, CA 92082

760-751-7709
www.harrahsrincon.com

Prices, hours of operation and treatments are subject to change.
Must be 21 or older to gamble.

While Tika extracts the text content, it also generates annotations representing the original markup of a document (if present) and its metadata, which can be displayed with the parameters -m and -a.

Filtering on Language

Since, for the sake of argument, we are only interested in the English documents in the corpus, we need to filter out all those which are in a different language. The language identification module uses the LangDetect library (http://code.google.com/p/language-detection/) to identify and add language IDs to each document.

We identify the language with:

(1) hadoop jar ./behemoth-lang*job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -i crawlpdf-Tika -o crawlpdf-Tika-lang

From the command line output or the hadoop jobtracker, one can see the distribution of languages in a corpus:

Then we can filter on the language ID, in this case 'en' - for English.

After having identified the languages, the filtering can be done either by using the CorpusFilter from the core module:



hadoop jar behemoth-core*-SNAPSHOT-job.jar 
com.digitalpebble.behemoth.util.CorpusFilter -D 
document.filter.md.keep.lang=en -i crawlpdf-Tika-lang -o crawlpdf-Tika-EN

Note that we could have done the same as part of the language identification step with :



(2) hadoop jar behemoth-lang*-SNAPSHOT-job.jar 
com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -D 
document.filter.md.keep.lang=en -i crawlpdf-Tika -o crawlpdf-Tika-EN

If you are only interested in filtering, the first step shown here is optional - the identification and filtering can be done in one step as shown in (2). The corresponding jobtracker output would look like this:

Clustering

Having filtered out all unwanted documents, we create the vectors representing the Behemoth documents, thanks to the resources in the Mahout module :

hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth -i crawlpdf-Tika-EN -o crawl-pdf-vec --namedVector

Having successfully finished the preprocessing and vector generation with Behemoth, we now change to Mahout (available here) to do the clustering.

Using kmeans clustering in Mahout, there are two ways of generating the initial clusters:

1) One can specify the desired number of output clusters and the initial centroids are generated as a first step in kmeans. This will probably be best, if you do not know your data very well, but do know how many clusters you want to have in the end.

2) Another option is to use canopy clustering, where you define a minimal distance between the centroids and the number of clusters depends on that distance and obviously also on the distance measure used.
There are ways to calculate the average distance between vectors in your corpus beforehand:

https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html

Using more appropriate values for the distance will probably give a more representative clustering result.

From your mahout folder:

Thus, creating the initial centroids with canopy clustering:

mahout canopy -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/canopy-centroids -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.1 -t2 0.5 -cl

Then you call kmeans, while specifying the newly-generated canopy-centroids in the c-argument. The distance measure used here is Tanimoto, which takes into account the document length.

mahout kmeans -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/clusters -c crawl-pdf-vec/canopy-centroids/clusters-0-final -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -x 10 -cd 0.1 -cl

Since we’re interested in seeing what documents are allocated to which cluster, the ClusterDocIDDumper in the Mahout module in Behemoth comes in useful:

hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.util.ClusterDocIDDumper -i crawl-pdf-vec/clusters/clusteredPoints -o crawl-pdf-vec/clusterID

To extract the results to the local file system:

hadoop fs -text crawl-pdf-vec/clusterID > crawlpdf-clusterID

et voila:

….
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/1505OK.pdf    6
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/1511OK.pdf    37
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/OKVocablist.pdf    19
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1501OR.pdf    23
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1502OR.pdf    42
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1503OR.pdf    43
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1511OR.pdf    44
http://hdmaster.com/testing/cnatesting/tennessee/tnformpages/tnforms/1402TN.pdf    10
....

Conclusion

This was merely an exercise meant to illustrate some of the capabilities of Behemoth and how it could be used to process the CommonCrawl dataset. There are more modules available, such as the GATE or UIMA ones that we could have used to extract named entities, or the SOLR module to index the documents.

We actually used the CommonCrawl dataset with Behemoth for one of our clients in order to identify CVs automatically using our text classification module alongside the Tika, GATE and Language ID modules. This was a great way of checking some of our assumptions before applying the same processes to the output of a Nutch crawl. CommonCrawl is a great resource and if you need to do some text processing on its content, it's very likely that Behemoth and that at least one of its existing modules should be useful.

Monday, 9 July 2012

Nutch 2.0 is out (at last!)

Like pretty much any 2.0 release, Nutch 2.0 marks a radical change from the 1.x branch. I've mentioned 2.0 in previous posts but let's do a bit of history first. Nutch was initially started by Doug Cutting (Lucene's creator) and Mike Caffarella around 2002, then came the MapReduce paper from Google and in 2005 MapReduce was implemented as part of Nutch then became a sub project of Lucene at Apache. You know what happened to Hadoop after that : open source super-stardom, millions of dollars in investment, fierce competition between commercial distributions but also a myriad of related projects (HBase, ZooKeeper, Pig, Hive, Mahout etc...) with in the background the emergence of new concepts such as Big Data or NoSQL.

Meanwhile Nutch tagged along following the various releases of Hadoop but was based on the same architecture. It simply started relying on other projects more and more instead of implementing its own stuff, mainly Apache Tika (another offspring of Nutch) for parsing and extracting metadata from various document formats and Apache SOLR for indexing and searching documents. This made the code much lighter, easier to maintain and also up to date with all sorts of functionalities provided by these projects. However the way we stored and access data in Nutch remained the same since the beginning of Hadoop i.e. SequenceFiles and MapFiles.

Nutch 2.0 (a.k.a NutchGora) started in earnest 2 years ago when one of our clients decided to invest in the development of a NoSQL-based version of Nutch. There had been a preliminary version called NutchBase developed by Dogacan Guney which was used as a basis except that instead of relying exclusively on HBase, we decided to implement our own Backend-Neutral-MapReduce-friendly-ORM which is now an Apache Top Level Project known as Apache GORA and serializes data with Apache AVRO. GORA provides us with a unified access to various backends, NoSQL or not, an object-to-datastore mapping mechanism and utilities for MapReduce. This means that Nutch 2.0 can run on HBase, Cassandra, Accumulo or MySQL with just a few configuration files to modify.

One major change in 2.0 is that, instead of having a separation between the status of the URLs (crawlDB) separated from the data for these URLs (content and text in segments) and the webgraph (linkDB), we have a single table-like representation of the data where each entry contains everything we know about a URL, even the links that point to it or the various versions of its content (depending on the backend used). Not having separate segments is definitely good news. One of the side effects is that a fetch or parse step can be resumed.

From a technical point of view this means that Nutch is not limited to the sequential processing of Hadoop data structures but can operate at a more atomic level (GET, PUT). Most Nutch tasks are still MapReduce operations though but at least we can get the backends to filter the data and provide only what is needed for a specific task to the MapReduce operations.

The best example of this that I can think of is the update step in a Nutch crawl. Basically what this step does is to merge the information from a round of fetching with the rest of the CrawlDB, typically to change the status of the URLs we have fetched and add the new URLs we have discovered when parsing. With the 1.x branch this is done with a MapReduce operation which takes both the CrawlDB and the segment as input, reduces on the URLs and updates the status of the CrawlDatum objects in the reduce step. All good. Except that as the crawlDB gets larger and larger, the time taken by the update step gets longer and longer up to a point where it ends up being the slowest part of the crawl. Think about a billion entries in the crawlDB and a single URL to update and you'll get the picture.

There are ways of alleviating this for 1.x (i.e. generate multiple segments in one go and update them with the crawldb at the same time) but the point is that with Nutch 2.0 the equivalent operation would be linear with the number of URLs modified, not the whole crawl dataset.

The change of paradigm between sequential datastructures to a table-like representation is a major change for Nutch which will certainly have many positive side-effects. Being the first release of 2.0, we can expect quite a few fixes to be needed and a massive overhaul of the documentation in the next months but the move seems to be positively welcomed by the Nutch community. Of course 1.x will continue to be the trunk for as long as necessary, i.e. until 2.0 is stable and has all the functionalities that 1.x has.

BTW my slides about 2.0 from last year's Berlin Buzzword are now here.

It is also a symbolic move, with Nutch being at the origin of many successful projects, it was about time it caught up with its famous offspring and the concepts which arose from it.

Wednesday, 13 June 2012

What's new in Nutch 1.5

Apache Nutch 1.5 has been released last week. As with each release, this one contains a lot of changes and I will just comment on a few of them.

The main change is actually not in the list above and has not been documented in the Wiki yet. The binary version of Nutch (apache-nutch-1.5.bin.*) now contains the local runtime only, i.e. what you get in runtime/local when compiling the sources. This should make things a bit more straightforward for beginners as we've seen quite a bit of confusion on the mailing lists about which configuration files should be modified (root/conf vs runtime/local/conf). The src version of Nutch is unchanged and is what you'll need if you want to run Nutch on an existing Hadoop cluster. Of course, the runtime/local directory will be generated too from the source and you'll be able to run Nutch in local mode as well. In a nutshell, if you are not sure about what you're doing, want to use Nutch in local mode without a Hadoop cluster and/or do not need any custom plugins then the binary version is what you're after. I usually recommend to use the distributed version on a pseudo-distributed Hadoop cluster for production as the Hadoop web interfaces provide a wealth of useful information, not mentioning of course that you can have more than one mapper or reducer and harness the full potential of your server.

Apart from the usual dependency updates (Hadoop 1.0.0, Tika 1.1), this release contains many improvements to the webgraph API, which is a better alternative than the default OPIC scoring in Nutch. In the future, it would be interesting to rely on a library such as Apache Giraph to compute the page ranks as it would simplify the code and also make it more efficient.

As mentioned in a previous post, the Nutch user and dev lists seem to indicate an increasing number of users, which is great. This also mean that we tend to see the same questions and issues coming over and over. One such question was about how to parse and index html metatags (see NUTCH-809) which I had contributed 2 years ago. The parse-metatags plugin is now available in the distribution and the steps are documented in the Wiki. Note that the parsing of the html metatags is not activated by default, this is something for the next release maybe.

An important and related change in Nutch 1.5 is NUTCH-1264 which provides a generic plugin for indexing metadata which is typically used alongside parsing plugins such as parse-metatags above and is based on configuration only. The metadata converted into fields for indexing can come from the crawldb, the parse metadata or the content metadata. More work is needed to delegate the indexing parts of existing plugins to it and this is likely to happen in the next release.

Again, Nutch 1.5 contains loads of improvements and you should definitely consider using it if you are on an older version. The next Nutch release will probably be 2.0 for which a RC is already available. Nutch 2.0, a.k.a NutchGora, is a complete redesign of Nutch based on Apache Gora and uses NoSQL datastores as backends instead of relying on the Hadoop data structures. We will have more releases from the 1.x branch as well as 2.x ones, until the latter gets stable and widely used by the community.

As usual, have a look, give it a try and contribute to Nutch if you can.

Friday, 21 October 2011

Nutch hosting and monitoring

We now provide hosting and monitoring services for Apache Nutch.

For a fixed price, we will set up, run and monitor your Nutch crawler and report on its progress. The cost of the servers is included in the offer and their hardware specs are superior to what you get from Amazon EC2, without long term commitment as the service is on a monthly basis only.
The price depends on the size of the cluster as well as the complexity of the crawl.

If you use Nutch to feed documents to a seach engine, we can also monitor and host your SOLR instances for you!

Monday, 26 September 2011

Visualising Nutch mailing-lists traffic

The graph below show the traffic on the Nutch dev and user mailing lists (http://mail-archives.apache.org/mod_mbox/nutch-user/ and http://mail-archives.apache.org/mod_mbox/nutch-dev/) from March 05 to August 11.

Traffic on Nutch mailing lists

(large size version of the graph here)

Unsurprisingly the traffic on the two lists follows similar trends with ups and downs and the user list globally more active than the dev list, apart from a period in 2005 (early Nutch development), a peak in July 2010 (discussions around Nutch 2.0 and refactoring of code) and the last few months. The figures for September are not complete but seem to confirm that Nutch is definitely back to a level of activity which has not been seen for the last 5 years.

Wednesday, 6 July 2011

Crawler-Commons 0.1 released

As announced on various mailing-lists :

The initial release of crawler-commons is available from : http://code.google.com/p/crawler-commons/downloads/list

The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.

The current version contains resources for :
- parsing robots.txt
- parsing sitemaps
- URL analyzer which returns Top Level Domains
- a simple HttpFetcher

This release is available on Sonatype's OSS Nexus repository [https://oss.sonatype.org/content/repositories/releases/com/google/code/crawler-commons/] and should be available on Maven Central soon.

Please send your questions, comments or suggestions to http://groups.google.com/group/crawler-commons

Doing the release was quite an interesting experience as I'd never done that before. This was the opportunity to have a closer look at ANT+Maven, how to publish artefacts and use Nexus etc... which I am sure will be useful at some point (Behemoth? GORA? Nutch?).

Now that crawler-commons is released we can start using it from Nutch, Bixo [see https://issues.apache.org/jira/browse/NUTCH-1031].

Sunday, 12 June 2011

Nutch 1.3 released + BerlinBuzzwords presentation

Nutch 1.3 has been released and contains quite a few changes, some of which have been retrofitted from Nutch 2.0 in trunk.

The main modification is that Nutch now relies entirely on SOLR for indexing and searching and we removed our indexer based on Lucene as well as the search webapps (NUTCH-837). The dependencies are managed with Apache Ivy (NUTCH-821) and we've upgraded the versions of SOLR to 3.1 and Tika to 0.9. Another important change is that we have two separate runtime environments for local and deployed configurations (NUTCH-843). Nutch 1.3 contains a lot more improvements and bugfixes so if you use Nutch you should probably migrate to it.

The presentation I gave this week at BerlinBuzzwords is now available online and covered both 1.3 and 2.0, as well as an overview of Nutch. The conference itself was great and I met quite a few Nutch users and people who planned to use it as well as Doug Cutting, the creator of Nutch himself!

There are quite a few things planned for the next release(s) and also a large amount of work to do on the documentation which is a bit dated and patchy. Luckily some new committers have recently joined the project and seem keen to help with this.