DigitalPebble's Blog: gate

Showing posts with label gate. Show all posts

Saturday, 19 March 2011

DigitalPebble is hiring!

We are looking for a candidate with the following skills and expertise :

strong background in NLP and Java
GATE, experience of writing plugins and PRs, excellent knowledge of JAPE
IE, Linked Data, Ontologies
statistical approaches and machine learning
large scale computing with Hadoop
knowledge of the following technologies / tools : Lucene, SOLR, NoSQL, Tika, UIMA, Mahout
good social and presentation skills
good spoken and written English, knowledge of other languages would be a plus
taste for challenges and problem solving

    DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering.

    More details on our activities can be found on our website. We would consider candidates working remotely with occasional travel to Bristol and our clients in UK and Europe. Being located in or near Bristol would be a plus.

    This job is an opportunity to get involved in the growth of a small company, work on interesting projects and take part in various Apache related projects and events. Bristol is also a great place to live.

   Please send your CV and cover letter before the 15th April 2011 to job@digitalpebble.com

    Best regards,

    Julien Nioche

Wednesday, 10 November 2010

Gora in incubation at Apache

Great news! GORA has been accepted in the Apache Incubator in September. It now has a brand new site, JIRA, wiki, subversion repository etc... As I explained in my very first post, GORA has been developed as a part of Nutch 2.0 to provide an abstract storage layer. Think about it as a ORM that can be plugged into a number of storage backends (Cassandra, Hbase, Mysql, etc...). What we also get from it is the ability to use these backends directly into Hadoop's MapReduce without having to write any custom code. Another way of looking at it is that it provides a simple and unified API over these various backends. This would allow for instance to develop a prototype using say, MySQL as a backend then switch to Cassandra when more scalability is needed. Since your application would be based on GORA you would not need to modify any of your code, but just the mapping schema (which is based on Apache Avro).

I was thinking about using HBase in Behemoth to avoid having multiple SequenceFiles but GORA would be a better solution as it would give us more options as to what backend to use. On top of that, we would be able to operate at an atomic level and not by batches only, i.e. process a single document from the store and put it back to the DB. Since Behemoth currently relies on the Hadoop data structures, we can only process a whole corpus and generate a new version as output, which is exactly why we wanted to have GORA in Nutch (imagine you have a 10+ billion crawlDB and add say 10M pages per fetch round - every update step in Nutch 1.x requires to read 1010M entries and write out between 1000 and 1010M; a bit wasteful isn't it? )

Assuming that we use GORA (and the AVRO schema for the Behemoth documents), we could then implement a custom Datastore in GATE to debug a Behemoth corpus or test a GATE application.

Now that GORA is in Apache-land, it will hopefully get more contributors involved and more back ends supported.

Wednesday, 27 October 2010

TextClassification plugin for GATE

Just to let you know that I've made public our TextClassification plugin for GATE. As its name indicates, it is a GATE plugin which uses
our TextClassification API for building a training corpus from
GATE documents or classify GATE docs with an existing model.

This code
has been around for some time but I did not find the time to make it
public until now. Since I regularly get emails asking me about it, I
thought it would be simpler to release it.

As usual, it is under the Apache Software License and will (maybe) get
a better documentation soon. For those of you familiar with the TC
API, this should be quite straightforward.

[1] http://github.com/jnioche/TextClassificationPlugin
[2] https://code.google.com/p/textclassification/

Saturday, 28 August 2010

Behemoth talk from BerlinBuzzwords 2010

The talk I gave on Behemoth at BerlinBuzzwords has been filmed (I do not dare watching it) and is available on http://blip.tv/file/3809855.

The slides can be found on http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/nioche_bbuzz2010.odp

The talk contains a quick demo of GATE and mentions Tika, UIMA and of course Hadoop.