I'll be giving a talk on Apache Nutch at Berlin Buzzwords.
This talk will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, Lucene, SOLR, Tika or HBase. The presentation will contain examples of real-case uses.
The second part of the presentation will be focused on the latest developments in Nutch and the changed introduces by the forthcoming version 2.0.
Saturday, 7 May 2011
Nutch talk at Berlin Buzzwords 2011
Labels:
berlinbuzzwords,
nutch
Tuesday, 22 March 2011
Search for US properties with SOLR and Maptimize
Our clients 5k50 have recently opened a preview of their real-estate search system which is based on Apache SOLR and Maptimize. Maptimize is a very nice tool which which manages the display of data on Google Maps by merging markers which are geographically close together.
We initially audited the existing SOLR setup then redesigned it to add more functionalities and optimise the search speed. The search itself is an interesting mix of map-driven filtering with SOLR queries and faceting. Any changes to the map (click on a cluster, zoom in/out) are reflected in the search results and facets and vice-versa.
Navia is a nice showcase for some of the most commonly used features of SOLR (i.e. faceting, more-like-this, autocompletion) and has a great identity thanks to its mix of geo and text search. It is currently in beta mode so we can expect a few more improvements over the next few weeks.
And please feel free to give it a try so that we can get plenty of data on the performance :-)
We initially audited the existing SOLR setup then redesigned it to add more functionalities and optimise the search speed. The search itself is an interesting mix of map-driven filtering with SOLR queries and faceting. Any changes to the map (click on a cluster, zoom in/out) are reflected in the search results and facets and vice-versa.
Navia is a nice showcase for some of the most commonly used features of SOLR (i.e. faceting, more-like-this, autocompletion) and has a great identity thanks to its mix of geo and text search. It is currently in beta mode so we can expect a few more improvements over the next few weeks.
And please feel free to give it a try so that we can get plenty of data on the performance :-)
Labels:
solr
Saturday, 19 March 2011
DigitalPebble is hiring!
We are looking for a candidate with the following skills and expertise :
DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering.
More details on our activities can be found on our website. We would consider candidates working remotely with occasional travel to Bristol and our clients in UK and Europe. Being located in or near Bristol would be a plus.
This job is an opportunity to get involved in the growth of a small company, work on interesting projects and take part in various Apache related projects and events. Bristol is also a great place to live.
Please send your CV and cover letter before the 15th April 2011 to job@digitalpebble.com
Best regards,
Julien Nioche
- strong background in NLP and Java
- GATE, experience of writing plugins and PRs, excellent knowledge of JAPE
- IE, Linked Data, Ontologies
- statistical approaches and machine learning
- large scale computing with Hadoop
- knowledge of the following technologies / tools : Lucene, SOLR, NoSQL, Tika, UIMA, Mahout
- good social and presentation skills
- good spoken and written English, knowledge of other languages would be a plus
- taste for challenges and problem solving
DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering.
More details on our activities can be found on our website. We would consider candidates working remotely with occasional travel to Bristol and our clients in UK and Europe. Being located in or near Bristol would be a plus.
This job is an opportunity to get involved in the growth of a small company, work on interesting projects and take part in various Apache related projects and events. Bristol is also a great place to live.
Please send your CV and cover letter before the 15th April 2011 to job@digitalpebble.com
Best regards,
Julien Nioche
Monday, 21 February 2011
Watson, the computer Behemoth in Jeopardy!
Alex Popescu's excellent blog mentioned the DeepQA project and IBM's supercomputer Watson. Watson's recent appearance on the US TV show Jeopardy!. Interestingly, DeepQA uses both Apache Hadoop and UIMA to analyse large volumes of documents to build DeepQA's knowledge-base.
As explained in https://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf
The article also mentions UIMA-AS and it is not entirely clear what part of the system uses what : is UIMA-AS used for the runtime analysis of the questions and Hadoop for the background learning?
Would be interesting to know what sort of UIMA annotators were used internally for the analysis of the text and, more importantly from Behemoth's point of view, whether it could have been used for this project and/or what features would have been required to get it to work on DeepQA.
As explained in https://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf
"To preprocess the corpus and create fast run-time indices we used Hadoop. UIMA annotators were easily deployed as mappers in the Hadoop map-reduce framework. Hadoop distributes thewhich is exactly what Behemoth does (how very reassuring!).
content over the cluster to afford high CPU utilization and provides convenient tools for deploying, managing, and monitoring the corpus analysis process."
The article also mentions UIMA-AS and it is not entirely clear what part of the system uses what : is UIMA-AS used for the runtime analysis of the questions and Hadoop for the background learning?
Would be interesting to know what sort of UIMA annotators were used internally for the analysis of the text and, more importantly from Behemoth's point of view, whether it could have been used for this project and/or what features would have been required to get it to work on DeepQA.
Friday, 21 January 2011
BerlinBuzzwords 2011
There is a CFP for BerlinBuzzwords 2011 which will be on 6/7 June. As the website says :
I presented Behemoth there last year and really enjoyed the conference. High quality talks, fantastic atmosphere and great exchanges with fellow open source committers. I really recommend it and will definitely try to go next year and probably give a short talk about Nutch 2.0, GORA or maybe give a quick update about Behemoth.Berlin Buzzwords 2011 is a conference for developers and users of open source software projects, focussing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags "search", "store" and "scale"
Labels:
behemoth,
berlinbuzzwords
Tuesday, 14 December 2010
Module management with IVY
I've just recently some massive changes to the way we manage the code in Behemoth. Prior to that, we had a single src directory containing the various resources for using Tika, GATE, UIMA or Nutch within Behemoth. That worked fine but had a few drawbacks, mostly that we ended up with an enormous job file containing all the dependencies for all the modules. In practice most people use Behemoth with only one type of resource but not more (e.g. UIMA vs GATE).
There was also a concept of Sandbox in Behemoth which I mentioned a couple of times. The idea was to allow external contributions based on Behemoth's core and keep them separated.
Before the change, Grant Ingersoll (who has been using Behemoth to parse a large amount of documents with Tika) had made a contribution which allowed to generate a jar file for the Behemoth core classes only. In his case, he wanted to be able to play with the Behemoth output without having to deal with a mega large job file. The modularisation of the code allows to do just that but extends the principle to all the modules.
Here is how it now works. I split the code into several modules managed by Apache Ivy (by simply following the tutorials) e.g. core, uima, gate, tika, solr, etc... Most non-core modules have at least a dependency to core as well as the external jars that they require. All modules have the same ant targets and the main ant build script at the root of the project allows to resolve the dependencies, compile, test for each module. We now get separate jars file for each module (which Grant needed for the core) but also publish these jars locally via Ivy so that the other modules can rely on them.
Building a job file is done on a per-module basis, by going into a module's root directory and calling 'ant job'. The resulting job file should then contain all the dependencies for this module and can be used in Hadoop, as usual.
This new organisation of the code is definitely cleaner, leaner and easier to maintain or extend. If for instance a user want to build a process which combines the functionalities of two or more modules, it is just a matter of creating a new module with the right dependencies to the modules used (say for instance Tika + Gate + SOLR), write a custom Job and Mapreduce class and generate a job file as described above.
The concept of sandboxes is now deprecated, as they are now modules, just like everything else. The beauty being that - if the Behemoth modules are published and accessible publicly, one could simply point to them in the Ivy config of a local module and build a Behemoth application with a minimal amount of code.
Isn't that just fun!
There was also a concept of Sandbox in Behemoth which I mentioned a couple of times. The idea was to allow external contributions based on Behemoth's core and keep them separated.
Before the change, Grant Ingersoll (who has been using Behemoth to parse a large amount of documents with Tika) had made a contribution which allowed to generate a jar file for the Behemoth core classes only. In his case, he wanted to be able to play with the Behemoth output without having to deal with a mega large job file. The modularisation of the code allows to do just that but extends the principle to all the modules.
Here is how it now works. I split the code into several modules managed by Apache Ivy (by simply following the tutorials) e.g. core, uima, gate, tika, solr, etc... Most non-core modules have at least a dependency to core as well as the external jars that they require. All modules have the same ant targets and the main ant build script at the root of the project allows to resolve the dependencies, compile, test for each module. We now get separate jars file for each module (which Grant needed for the core) but also publish these jars locally via Ivy so that the other modules can rely on them.
Building a job file is done on a per-module basis, by going into a module's root directory and calling 'ant job'. The resulting job file should then contain all the dependencies for this module and can be used in Hadoop, as usual.
This new organisation of the code is definitely cleaner, leaner and easier to maintain or extend. If for instance a user want to build a process which combines the functionalities of two or more modules, it is just a matter of creating a new module with the right dependencies to the modules used (say for instance Tika + Gate + SOLR), write a custom Job and Mapreduce class and generate a job file as described above.
The concept of sandboxes is now deprecated, as they are now modules, just like everything else. The beauty being that - if the Behemoth modules are published and accessible publicly, one could simply point to them in the Ivy config of a local module and build a Behemoth application with a minimal amount of code.
Isn't that just fun!
Wednesday, 10 November 2010
Gora in incubation at Apache
Great news! GORA has been accepted in the Apache Incubator in September. It now has a brand new site, JIRA, wiki, subversion repository etc... As I explained in my very first post, GORA has been developed as a part of Nutch 2.0 to provide an abstract storage layer. Think about it as a ORM that can be plugged into a number of storage backends (Cassandra, Hbase, Mysql, etc...). What we also get from it is the ability to use these backends directly into Hadoop's MapReduce without having to write any custom code. Another way of looking at it is that it provides a simple and unified API over these various backends. This would allow for instance to develop a prototype using say, MySQL as a backend then switch to Cassandra when more scalability is needed. Since your application would be based on GORA you would not need to modify any of your code, but just the mapping schema (which is based on Apache Avro).
I was thinking about using HBase in Behemoth to avoid having multiple SequenceFiles but GORA would be a better solution as it would give us more options as to what backend to use. On top of that, we would be able to operate at an atomic level and not by batches only, i.e. process a single document from the store and put it back to the DB. Since Behemoth currently relies on the Hadoop data structures, we can only process a whole corpus and generate a new version as output, which is exactly why we wanted to have GORA in Nutch (imagine you have a 10+ billion crawlDB and add say 10M pages per fetch round - every update step in Nutch 1.x requires to read 1010M entries and write out between 1000 and 1010M; a bit wasteful isn't it? )
Assuming that we use GORA (and the AVRO schema for the Behemoth documents), we could then implement a custom Datastore in GATE to debug a Behemoth corpus or test a GATE application.
Now that GORA is in Apache-land, it will hopefully get more contributors involved and more back ends supported.
I was thinking about using HBase in Behemoth to avoid having multiple SequenceFiles but GORA would be a better solution as it would give us more options as to what backend to use. On top of that, we would be able to operate at an atomic level and not by batches only, i.e. process a single document from the store and put it back to the DB. Since Behemoth currently relies on the Hadoop data structures, we can only process a whole corpus and generate a new version as output, which is exactly why we wanted to have GORA in Nutch (imagine you have a 10+ billion crawlDB and add say 10M pages per fetch round - every update step in Nutch 1.x requires to read 1010M entries and write out between 1000 and 1010M; a bit wasteful isn't it? )
Assuming that we use GORA (and the AVRO schema for the Behemoth documents), we could then implement a custom Datastore in GATE to debug a Behemoth corpus or test a GATE application.
Now that GORA is in Apache-land, it will hopefully get more contributors involved and more back ends supported.
Subscribe to:
Posts (Atom)