DigitalPebble's Blog

Monday, 21 February 2011

Watson, the computer Behemoth in Jeopardy!

Alex Popescu's excellent blog mentioned the DeepQA project and IBM's supercomputer Watson. Watson's recent appearance on the US TV show Jeopardy!. Interestingly, DeepQA uses both Apache Hadoop and UIMA to analyse large volumes of documents to build DeepQA's knowledge-base.

As explained in https://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf

"To preprocess the corpus and create fast run-time indices we used Hadoop. UIMA annotators were easily deployed as mappers in the Hadoop map-reduce framework. Hadoop distributes the
content over the cluster to afford high CPU utilization and provides convenient tools for deploying, managing, and monitoring the corpus analysis process."

which is exactly what Behemoth does (how very reassuring!).

The article also mentions UIMA-AS and it is not entirely clear what part of the system uses what : is UIMA-AS used for the runtime analysis of the questions and Hadoop for the background learning?

Would be interesting to know what sort of UIMA annotators were used internally for the analysis of the text and, more importantly from Behemoth's point of view, whether it could have been used for this project and/or what features would have been required to get it to work on DeepQA.

Friday, 21 January 2011

BerlinBuzzwords 2011

There is a CFP for BerlinBuzzwords 2011 which will be on 6/7 June. As the website says :

Berlin Buzzwords 2011 is a conference for developers and users of open source software projects, focussing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags "search", "store" and "scale"

I presented Behemoth there last year and really enjoyed the conference. High quality talks, fantastic atmosphere and great exchanges with fellow open source committers. I really recommend it and will definitely try to go next year and probably give a short talk about Nutch 2.0, GORA or maybe give a quick update about Behemoth.

Tuesday, 14 December 2010

Module management with IVY

I've just recently some massive changes to the way we manage the code in Behemoth. Prior to that, we had a single src directory containing the various resources for using Tika, GATE, UIMA or Nutch within Behemoth. That worked fine but had a few drawbacks, mostly that we ended up with an enormous job file containing all the dependencies for all the modules. In practice most people use Behemoth with only one type of resource but not more (e.g. UIMA vs GATE).

There was also a concept of Sandbox in Behemoth which I mentioned a couple of times. The idea was to allow external contributions based on Behemoth's core and keep them separated.

Before the change, Grant Ingersoll (who has been using Behemoth to parse a large amount of documents with Tika) had made a contribution which allowed to generate a jar file for the Behemoth core classes only. In his case, he wanted to be able to play with the Behemoth output without having to deal with a mega large job file. The modularisation of the code allows to do just that but extends the principle to all the modules.

Here is how it now works. I split the code into several modules managed by Apache Ivy (by simply following the tutorials) e.g. core, uima, gate, tika, solr, etc... Most non-core modules have at least a dependency to core as well as the external jars that they require. All modules have the same ant targets and the main ant build script at the root of the project allows to resolve the dependencies, compile, test for each module. We now get separate jars file for each module (which Grant needed for the core) but also publish these jars locally via Ivy so that the other modules can rely on them.

Building a job file is done on a per-module basis, by going into a module's root directory and calling 'ant job'. The resulting job file should then contain all the dependencies for this module and can be used in Hadoop, as usual.

This new organisation of the code is definitely cleaner, leaner and easier to maintain or extend. If for instance a user want to build a process which combines the functionalities of two or more modules, it is just a matter of creating a new module with the right dependencies to the modules used (say for instance Tika + Gate + SOLR), write a custom Job and Mapreduce class and generate a job file as described above.

The concept of sandboxes is now deprecated, as they are now modules, just like everything else. The beauty being that - if the Behemoth modules are published and accessible publicly, one could simply point to them in the Ivy config of a local module and build a Behemoth application with a minimal amount of code.

Isn't that just fun!

Wednesday, 10 November 2010

Gora in incubation at Apache

Great news! GORA has been accepted in the Apache Incubator in September. It now has a brand new site, JIRA, wiki, subversion repository etc... As I explained in my very first post, GORA has been developed as a part of Nutch 2.0 to provide an abstract storage layer. Think about it as a ORM that can be plugged into a number of storage backends (Cassandra, Hbase, Mysql, etc...). What we also get from it is the ability to use these backends directly into Hadoop's MapReduce without having to write any custom code. Another way of looking at it is that it provides a simple and unified API over these various backends. This would allow for instance to develop a prototype using say, MySQL as a backend then switch to Cassandra when more scalability is needed. Since your application would be based on GORA you would not need to modify any of your code, but just the mapping schema (which is based on Apache Avro).

I was thinking about using HBase in Behemoth to avoid having multiple SequenceFiles but GORA would be a better solution as it would give us more options as to what backend to use. On top of that, we would be able to operate at an atomic level and not by batches only, i.e. process a single document from the store and put it back to the DB. Since Behemoth currently relies on the Hadoop data structures, we can only process a whole corpus and generate a new version as output, which is exactly why we wanted to have GORA in Nutch (imagine you have a 10+ billion crawlDB and add say 10M pages per fetch round - every update step in Nutch 1.x requires to read 1010M entries and write out between 1000 and 1010M; a bit wasteful isn't it? )

Assuming that we use GORA (and the AVRO schema for the Behemoth documents), we could then implement a custom Datastore in GATE to debug a Behemoth corpus or test a GATE application.

Now that GORA is in Apache-land, it will hopefully get more contributors involved and more back ends supported.

Wednesday, 27 October 2010

TextClassification plugin for GATE

Just to let you know that I've made public our TextClassification plugin for GATE. As its name indicates, it is a GATE plugin which uses
our TextClassification API for building a training corpus from
GATE documents or classify GATE docs with an existing model.

This code
has been around for some time but I did not find the time to make it
public until now. Since I regularly get emails asking me about it, I
thought it would be simpler to release it.

As usual, it is under the Apache Software License and will (maybe) get
a better documentation soon. For those of you familiar with the TC
API, this should be quite straightforward.

[1] http://github.com/jnioche/TextClassificationPlugin
[2] https://code.google.com/p/textclassification/

Monday, 27 September 2010

SimilarPages is out!

It's always nice to see clients emerging of stealth mode and showing the fruits of their labour to the public. Our friends at http://www.similarpages.com have just done so and I am doubly pleased as this also reflect the work that DigitalPebble did for them.

SimilarPages is an add-on for Firefox which allows you to discover pages which are similar to the one you are currently reading. It's pretty cool and surprisingly easy to use. It's also completely free which is always nice.

From a technical point of view, we helped SimilarPages adapting Nutch to their needs and deploying it on a 400-nodes cluster on Amazon EC2 to crawl the web. We fetched and parsed a total of more than 3 billion pages from which we obtained 200+ million lists of similarities. The crawlDB itself contained more than 10 billion URLs.

Operating at such a scale is definitely challenging and has been a great experience. From a Nutch point of view, quite a few improvements and bugfixes in Nutch 1.1 come directly from the work done at SimilarPages, so thanks for that guys!

The SimilarPages use case is actually a good example of using Nutch as a crawling platform only, i.e. not indexing the documents with Lucene or SOLR. See A. Białecki's presentation at BerlinBuzzwords 2010 for more examples on this subject.

The computation of the similarities between URLs from the Nutch crawls is made using bespoke Hadoop Map-Reduce code. More details on their approach can be found on SimilarPages' website.

I feel quite proud to have contributed to this project and wish long live to SimilarPages. If you haven't done so, give it a try!

Apache Nutch 1.2 released

[quoting the announcement by Chris Mattmann]

The Apache Nutch project is pleased to announce the release of Apache Nutch
1.2. The release contents have been pushed out to the main Apache release
site so the releases should be available as soon as the mirrors get the
syncs.

Apache Nutch, one of the six new Apache TLPs as a result of the April 2010
Board Meeting, is an extensible framework for building out large-scale
web-based search. Layered on top of fellow Apache projects Hadoop,
Lucene/Solr, and Tika, Nutch provides an out of the box platform for
fetching web pages, pdf files, word documents, and more. Nutch parses the
content and its relevant information, indexes its metadata, and makes it
available for efficient query and retrieval over modern Internet protocols.

Apache Nutch 1.2 contains a number of improvements and bug fixes. Details
can be found in the changes file:

http://www.apache.org/dist/nutch/CHANGES-1.2.txt

Apache Nutch is available in source and binary form from the following
download page: http://www.apache.org/dyn/closer.cgi/nutch/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the downloads
using signatures found on the Apache site:

http://www.apache.org/dist/nutch/KEYS-1.2.txt

For more information on Apache Nutch, visit the project home page:
http://nutch.apache.org