DigitalPebble's Blog

Wednesday 5 June 2013

DigitalPebble is hiring!

We are looking for a candidate with the following skills and expertise :

* experience in web crawling, ideally with Apache Nutch
* Storm, Hadoop and related technologies
* strong Java skills
* interest in text processing, NLP and ML
* good social and presentation skills
* good spoken and written English, knowledge of other languages would be a plus
* taste for challenges and problem solving

DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering. We provide our expertise in fields such as from web crawling, NLP, ML and Search, with a focus on Open Source and Big Data.

More details on our activities can be found on our website. The position is in Bristol, UK.

This job is an opportunity to get involved in the growth of a small company, be a key player in interesting projects with our clients and work on open source software. Bristol is also a great place to live.

Please send your CV and cover letter to job@digitalpebble.com before the 30th June 2013.

Friday 8 March 2013

Free your Nutch crawls with pluggable indexers

I have just committed what should be a very important new feature of the next 1.x release of Apache Nutch, namely the possibility to implement indexing backends via plugins. This is currently on the trunk only but should hopefully be ported to 2.x at some point. The Nutch-1047 JIRA issue contains a history of patches and discussions for this feature.

As you'll see by reading the explanations below, this is not the same thing as the indexing filters or the storage backends in Nutch 2.x.

Historically, Nutch used to manage its own Lucene indices itself and provide a web interface for querying them. Support for SOLR was added much later in the 1.0 release (NUTCH-442) and users had two separate commands for indexing directly with Lucene or sending the documents to SOLR, in which case the search could be done outside the Nutch search servers and directly with SOLR. We then decided to drop the Nutch search servers and the Lucene-based indexing altogether in Nutch 1.3 (NUTCH-837) and let the SOLR indexer become the only option. This was an excellent move as it greatly reduced the amount of code we had to look after and meant that we could focus on the crawling while benefiting from the advances in SOLR.

One of the nice things about Nutch is that most of its components are based on plugins. The actual plugin mechanism was borrowed from Eclipse and allows to have endpoints and extensions. Nutch has extension points for URLFilters, URLNormalizers, Parsers, Protocols, etc... The full list of Nutch extensions can be found here. Basically pretty much everything in Nutch is done via plugins and I found that most customisations of Nutch I do for my clients are usually implemented via plugins only.

As you've guessed, NUTCH-1047 is about having generic commands for indexing and handling the backend implementations via plugins. Instead of piggybacking the SOLR indexer code to send the documents to a different backend, one can now use the brand new generic IndexingJob and isolate the logic of how the documents are sent to the backend via an extension of the new IndexWriter endpoint in a custom plugin.

The IndexWriter interface is pretty straightforward :

public String describe();

public void open(JobConf job, String name) throws IOException;

public void write(NutchDocument doc) throws IOException;

public void delete(String key) throws IOException;

public void update(NutchDocument doc) throws IOException;

public void commit() throws IOException;

public void close() throws IOException;

Having this mechanism allows us to move most of the SOLR-specific code to the new indexer-solr plugin (and hopefully all of it as soon as we have a generic de-duplicator which could use the IndexWriter plugins) but more importantly will facilitate the implementation of popular indexing backends such as ElasticSearch or Amazon's CloudSearch service without making the core code of Nutch more complex. We frequently get people on the mailing list asking how to store the Nutch documents on such or such database and being able to do that in a plugin will definitely make it easier. It will also be a good way of storing Nutch documents as files, etc...

This is quite a big change to the architecture of Nutch but we tried to make it as transparent as possible for end users. The only indexer plugin currently available is a port of the existing code for SOLR and is activated by default. We left the old solr* commands and modified them so that they use the generic commands with the indexing plugins in the background so from a user point of view there should be no difference at all.

There is already a JIRA for a text-based CSV indexing plugin and I expect that the ElasticSearch one will get rapid adoption.

I had been willing to find the time to work on this for quite some time and I'm very pleased it is now committed, thanks to the comments and reviews I got from my fellow Nutch developers. I look forward to getting more feedback and seeing it being used, extended, improved, etc...

Wednesday 5 September 2012

Using Behemoth on the CommonCrawl dataset

Behemoth is an open-source platform for document processing based on Hadoop which provides an excellent way to process document collections on a large scale, such as crawled pages obtained with Nutch or CommonCrawl.

Today, we are going to use a segment of the CommonCrawl dataset and show how to import data in Behemoth, filter on some common attributes and generate vectors for clustering with Apache Mahout.

CommonCrawl

The CommonCrawl dataset (http://commoncrawl.org/) is an open repository of web crawl data comprising 3.8 billion documents that are universally accessible. The data is available in different formats, the most recent one separating the raw content (ARC) from the metadata in JSON and the text (HTML only).

The ARC and text formats can be handled by the CommonCrawl module in Behemoth.

This module converts CommonCrawl data to SequenceFiles of BehemothDocuments.The difference between the documents obtained in one format or the other lies in what is added in the BehemothDocs, which is binary content for the ARC and text for the text format.

In order to access this source, you will need to get an AWS (Amazon Web Services) account, because using this data is non-free.

What you need to set up for this step:
https://github.com/DigitalPebble/behemoth
https://github.com/DigitalPebble/behemoth-commoncrawl

Getting the data

Once Behemoth and its module for CommonCrawl have been installed, we can go to the command line and “cd” into the behemoth-commoncrawl folder:

We get the data from CommonCrawl and convert it into a Behemoth corpus:

hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.commoncrawl.CommonCrawlConverterJob2012 -D fs.s3n.awsAccessKeyId={YOUR_OWN_ID} fs.s3n.awsSecretAccessKey={YOUR_OWN_KEY} -D document.filter.mimetype.keep=application/pdf s3n://aws-publicdatasets/common-crawl/parse-output/segment/1350433107106/* test-crawlpdf

In this example, we filter on the mime type, since we only want to import pdf documents.

By setting the filter: -D document.filter.mimetype.keep=application/pdf, we limit what is imported from CommonCrawl. The filter takes a regular expression and will import only those documents whose mime type matches the regular expression. Note that it is possible to filter based on other things such as the URL, the length of the document or any other metadata.

In order to inspect, what has just been imported, we can now call the CorpusReader and look at the content of the Behemoth sequence file.

hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i test-crawlpdf -c -t

The document corpus excerpt gives you some information on the source of the document, the content type and shows the first lines of the binary content (parameter -c). Note the parameter -t which displays the text for the document, however since the documents were generated from the ARC, the text has not been extracted yet.

Output after getting the data:

url: http://www.harrahsrincon.com/images/non_image_assets/RIN_New_spa_menu_web.pdf
contentType: application/pdf
Content:
%PDF-1.6
%��
101 0 obj <</Linearized 1/L 527200/O 104/E 88751/N 12/T 525137/H [ 736 441]>>
endobj

xref
101 22
0000000016 00000 n
0000001177 00000 n
0000001293 00000 n
0000001418 0

Text Extraction

To then obtain the text, we use the Tika module in Behemoth which extracts the text from the documents in a Behemoth sequence file.

hadoop jar ./behemoth-tika--job.jar com.digitalpebble.behemoth.tika.TikaDriver -i test-crawlpdf/ -o crawlpdf-Tika

Now, we again inspect the corpus and see the extracted text content (omitting the parameter -c)

Output after extracting the text content:

url: http://www.harrahsrincon.com/images/non_image_assets/RIN_New_spa_menu_web.pdf
contentType: application/pdf
Text:

777 Harrah’s Rincon Way
Valley Center, CA 92082

760-751-7709
www.harrahsrincon.com

Prices, hours of operation and treatments are subject to change.
Must be 21 or older to gamble.

While Tika extracts the text content, it also generates annotations representing the original markup of a document (if present) and its metadata, which can be displayed with the parameters -m and -a.

Filtering on Language

Since, for the sake of argument, we are only interested in the English documents in the corpus, we need to filter out all those which are in a different language. The language identification module uses the LangDetect library (http://code.google.com/p/language-detection/) to identify and add language IDs to each document.

We identify the language with:

(1) hadoop jar ./behemoth-lang*job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -i crawlpdf-Tika -o crawlpdf-Tika-lang

From the command line output or the hadoop jobtracker, one can see the distribution of languages in a corpus:

Then we can filter on the language ID, in this case 'en' - for English.

After having identified the languages, the filtering can be done either by using the CorpusFilter from the core module:



hadoop jar behemoth-core*-SNAPSHOT-job.jar 
com.digitalpebble.behemoth.util.CorpusFilter -D 
document.filter.md.keep.lang=en -i crawlpdf-Tika-lang -o crawlpdf-Tika-EN

Note that we could have done the same as part of the language identification step with :



(2) hadoop jar behemoth-lang*-SNAPSHOT-job.jar 
com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -D 
document.filter.md.keep.lang=en -i crawlpdf-Tika -o crawlpdf-Tika-EN

If you are only interested in filtering, the first step shown here is optional - the identification and filtering can be done in one step as shown in (2). The corresponding jobtracker output would look like this:

Clustering

Having filtered out all unwanted documents, we create the vectors representing the Behemoth documents, thanks to the resources in the Mahout module :

hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth -i crawlpdf-Tika-EN -o crawl-pdf-vec --namedVector

Having successfully finished the preprocessing and vector generation with Behemoth, we now change to Mahout (available here) to do the clustering.

Using kmeans clustering in Mahout, there are two ways of generating the initial clusters:

1) One can specify the desired number of output clusters and the initial centroids are generated as a first step in kmeans. This will probably be best, if you do not know your data very well, but do know how many clusters you want to have in the end.

2) Another option is to use canopy clustering, where you define a minimal distance between the centroids and the number of clusters depends on that distance and obviously also on the distance measure used.
There are ways to calculate the average distance between vectors in your corpus beforehand:

https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html

Using more appropriate values for the distance will probably give a more representative clustering result.

From your mahout folder:

Thus, creating the initial centroids with canopy clustering:

mahout canopy -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/canopy-centroids -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.1 -t2 0.5 -cl

Then you call kmeans, while specifying the newly-generated canopy-centroids in the c-argument. The distance measure used here is Tanimoto, which takes into account the document length.

mahout kmeans -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/clusters -c crawl-pdf-vec/canopy-centroids/clusters-0-final -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -x 10 -cd 0.1 -cl

Since we’re interested in seeing what documents are allocated to which cluster, the ClusterDocIDDumper in the Mahout module in Behemoth comes in useful:

hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.util.ClusterDocIDDumper -i crawl-pdf-vec/clusters/clusteredPoints -o crawl-pdf-vec/clusterID

To extract the results to the local file system:

hadoop fs -text crawl-pdf-vec/clusterID > crawlpdf-clusterID

et voila:

….
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/1505OK.pdf    6
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/1511OK.pdf    37
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/OKVocablist.pdf    19
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1501OR.pdf    23
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1502OR.pdf    42
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1503OR.pdf    43
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1511OR.pdf    44
http://hdmaster.com/testing/cnatesting/tennessee/tnformpages/tnforms/1402TN.pdf    10
....

Conclusion

This was merely an exercise meant to illustrate some of the capabilities of Behemoth and how it could be used to process the CommonCrawl dataset. There are more modules available, such as the GATE or UIMA ones that we could have used to extract named entities, or the SOLR module to index the documents.

We actually used the CommonCrawl dataset with Behemoth for one of our clients in order to identify CVs automatically using our text classification module alongside the Tika, GATE and Language ID modules. This was a great way of checking some of our assumptions before applying the same processes to the output of a Nutch crawl. CommonCrawl is a great resource and if you need to do some text processing on its content, it's very likely that Behemoth and that at least one of its existing modules should be useful.

Monday 9 July 2012

Nutch 2.0 is out (at last!)

Like pretty much any 2.0 release, Nutch 2.0 marks a radical change from the 1.x branch. I've mentioned 2.0 in previous posts but let's do a bit of history first. Nutch was initially started by Doug Cutting (Lucene's creator) and Mike Caffarella around 2002, then came the MapReduce paper from Google and in 2005 MapReduce was implemented as part of Nutch then became a sub project of Lucene at Apache. You know what happened to Hadoop after that : open source super-stardom, millions of dollars in investment, fierce competition between commercial distributions but also a myriad of related projects (HBase, ZooKeeper, Pig, Hive, Mahout etc...) with in the background the emergence of new concepts such as Big Data or NoSQL.

Meanwhile Nutch tagged along following the various releases of Hadoop but was based on the same architecture. It simply started relying on other projects more and more instead of implementing its own stuff, mainly Apache Tika (another offspring of Nutch) for parsing and extracting metadata from various document formats and Apache SOLR for indexing and searching documents. This made the code much lighter, easier to maintain and also up to date with all sorts of functionalities provided by these projects. However the way we stored and access data in Nutch remained the same since the beginning of Hadoop i.e. SequenceFiles and MapFiles.

Nutch 2.0 (a.k.a NutchGora) started in earnest 2 years ago when one of our clients decided to invest in the development of a NoSQL-based version of Nutch. There had been a preliminary version called NutchBase developed by Dogacan Guney which was used as a basis except that instead of relying exclusively on HBase, we decided to implement our own Backend-Neutral-MapReduce-friendly-ORM which is now an Apache Top Level Project known as Apache GORA and serializes data with Apache AVRO. GORA provides us with a unified access to various backends, NoSQL or not, an object-to-datastore mapping mechanism and utilities for MapReduce. This means that Nutch 2.0 can run on HBase, Cassandra, Accumulo or MySQL with just a few configuration files to modify.

One major change in 2.0 is that, instead of having a separation between the status of the URLs (crawlDB) separated from the data for these URLs (content and text in segments) and the webgraph (linkDB), we have a single table-like representation of the data where each entry contains everything we know about a URL, even the links that point to it or the various versions of its content (depending on the backend used). Not having separate segments is definitely good news. One of the side effects is that a fetch or parse step can be resumed.

From a technical point of view this means that Nutch is not limited to the sequential processing of Hadoop data structures but can operate at a more atomic level (GET, PUT). Most Nutch tasks are still MapReduce operations though but at least we can get the backends to filter the data and provide only what is needed for a specific task to the MapReduce operations.

The best example of this that I can think of is the update step in a Nutch crawl. Basically what this step does is to merge the information from a round of fetching with the rest of the CrawlDB, typically to change the status of the URLs we have fetched and add the new URLs we have discovered when parsing. With the 1.x branch this is done with a MapReduce operation which takes both the CrawlDB and the segment as input, reduces on the URLs and updates the status of the CrawlDatum objects in the reduce step. All good. Except that as the crawlDB gets larger and larger, the time taken by the update step gets longer and longer up to a point where it ends up being the slowest part of the crawl. Think about a billion entries in the crawlDB and a single URL to update and you'll get the picture.

There are ways of alleviating this for 1.x (i.e. generate multiple segments in one go and update them with the crawldb at the same time) but the point is that with Nutch 2.0 the equivalent operation would be linear with the number of URLs modified, not the whole crawl dataset.

The change of paradigm between sequential datastructures to a table-like representation is a major change for Nutch which will certainly have many positive side-effects. Being the first release of 2.0, we can expect quite a few fixes to be needed and a massive overhaul of the documentation in the next months but the move seems to be positively welcomed by the Nutch community. Of course 1.x will continue to be the trunk for as long as necessary, i.e. until 2.0 is stable and has all the functionalities that 1.x has.

BTW my slides about 2.0 from last year's Berlin Buzzword are now here.

It is also a symbolic move, with Nutch being at the origin of many successful projects, it was about time it caught up with its famous offspring and the concepts which arose from it.

Wednesday 13 June 2012

What's new in Nutch 1.5

Apache Nutch 1.5 has been released last week. As with each release, this one contains a lot of changes and I will just comment on a few of them.

The main change is actually not in the list above and has not been documented in the Wiki yet. The binary version of Nutch (apache-nutch-1.5.bin.*) now contains the local runtime only, i.e. what you get in runtime/local when compiling the sources. This should make things a bit more straightforward for beginners as we've seen quite a bit of confusion on the mailing lists about which configuration files should be modified (root/conf vs runtime/local/conf). The src version of Nutch is unchanged and is what you'll need if you want to run Nutch on an existing Hadoop cluster. Of course, the runtime/local directory will be generated too from the source and you'll be able to run Nutch in local mode as well. In a nutshell, if you are not sure about what you're doing, want to use Nutch in local mode without a Hadoop cluster and/or do not need any custom plugins then the binary version is what you're after. I usually recommend to use the distributed version on a pseudo-distributed Hadoop cluster for production as the Hadoop web interfaces provide a wealth of useful information, not mentioning of course that you can have more than one mapper or reducer and harness the full potential of your server.

Apart from the usual dependency updates (Hadoop 1.0.0, Tika 1.1), this release contains many improvements to the webgraph API, which is a better alternative than the default OPIC scoring in Nutch. In the future, it would be interesting to rely on a library such as Apache Giraph to compute the page ranks as it would simplify the code and also make it more efficient.

As mentioned in a previous post, the Nutch user and dev lists seem to indicate an increasing number of users, which is great. This also mean that we tend to see the same questions and issues coming over and over. One such question was about how to parse and index html metatags (see NUTCH-809) which I had contributed 2 years ago. The parse-metatags plugin is now available in the distribution and the steps are documented in the Wiki. Note that the parsing of the html metatags is not activated by default, this is something for the next release maybe.

An important and related change in Nutch 1.5 is NUTCH-1264 which provides a generic plugin for indexing metadata which is typically used alongside parsing plugins such as parse-metatags above and is based on configuration only. The metadata converted into fields for indexing can come from the crawldb, the parse metadata or the content metadata. More work is needed to delegate the indexing parts of existing plugins to it and this is likely to happen in the next release.

Again, Nutch 1.5 contains loads of improvements and you should definitely consider using it if you are on an older version. The next Nutch release will probably be 2.0 for which a RC is already available. Nutch 2.0, a.k.a NutchGora, is a complete redesign of Nutch based on Apache Gora and uses NoSQL datastores as backends instead of relying on the Hadoop data structures. We will have more releases from the 1.x branch as well as 2.x ones, until the latter gets stable and widely used by the community.

As usual, have a look, give it a try and contribute to Nutch if you can.

Friday 21 October 2011

Nutch hosting and monitoring

We now provide hosting and monitoring services for Apache Nutch.

For a fixed price, we will set up, run and monitor your Nutch crawler and report on its progress. The cost of the servers is included in the offer and their hardware specs are superior to what you get from Amazon EC2, without long term commitment as the service is on a monthly basis only.
The price depends on the size of the cluster as well as the complexity of the crawl.

If you use Nutch to feed documents to a seach engine, we can also monitor and host your SOLR instances for you!

Monday 26 September 2011

Visualising Nutch mailing-lists traffic

The graph below show the traffic on the Nutch dev and user mailing lists (http://mail-archives.apache.org/mod_mbox/nutch-user/ and http://mail-archives.apache.org/mod_mbox/nutch-dev/) from March 05 to August 11.

Traffic on Nutch mailing lists

(large size version of the graph here)

Unsurprisingly the traffic on the two lists follows similar trends with ups and downs and the user list globally more active than the dev list, apart from a period in 2005 (early Nutch development), a peak in July 2010 (discussions around Nutch 2.0 and refactoring of code) and the last few months. The figures for September are not complete but seem to confirm that Nutch is definitely back to a level of activity which has not been seen for the last 5 years.