DigitalPebble's Blog: nutch

Showing posts with label nutch. Show all posts

Friday, 4 September 2015

What's new in Storm-Crawler 0.6

We have just released version 0.6 of Storm-Crawler, an open source web crawling SDK based on Apache Storm. Storm-Crawler provides resources for building scalable, low-latency web crawlers and is used in production at various companies.

We have added loads of improvements and bug fixes since our previous release last June, thanks to the efforts of the community. The activity around the project has been very steady and a new committer (Jorge Luis Betancourt) has joined our ranks. We also had contributions from various users, which is great.

Here are the main features of version 0.6.

Dependencies upgrades

Storm 0.9.5
crawler-commons 0.6
Tika 1.10

Code reorganisation

Organise external content as separate sub-modules #145
Removed external/metrics #160

API changes

ParseFilter from interface to abstract class #159

Parse can output more than one document #135

New features and resources

SimpleFetcherBolt enforces politeness #181
New RobotsURLFilter #178
New ContentFilter to restrict text of document to XPath match #150
Adding support for using the canonical URL in the IndexerBolts #161
Improvement to SitemapParserBolt #143
Enforce robots meta instructions #148
Expand XPathFilter to accept a list of expressions as an argument #153
JSoupParserBolt does a basic check of the content type #151

External resources

The external (non-core) resources have been separated into discrete sub-modules as their number was getting larger.

SOLR

Our brand new module for Apache SOLR (see #152) is comparable to the existing ElasticSearch equivalent and provides an IndexerBolt, a MetricsConsumer and a SOLRSpout and StatusUpdaterBolt.

SQL

Not all web crawls require scalable big data solutions. I conducted a survey of Apache Nutch users some time ago which showed that most people used it on a single machine and less than a million URL. These are often people crawling a single website. With that in mind, we added a spout and StatusUpdaterBolt implementations to use MySQL as a storage for URL status which is useful for small recursive crawls. See #172 for details.

AWS CloudSearch

There is also a new AWS module containing an IndexerBolt for Amazon CloudSearch (see #174).

We hope that people find these improvements useful and would like to thank all users and contributors.

Monday, 16 September 2013

NUTCH FIGHT! 1.7 vs 2.2.1

We've had releases in the Nutch 2.x branch for over a year now. As I described in a previous post, the main difference with the 1.x branch is the use of Apache Gora as a storage abstraction layer, which allows to use various flavours of NoSQL databases such as HBase, Cassandra or Accumulo as backends.

There seems to be a growing number of 2.x users, even though 1.x probably still holds the lead, and 2.x (and Gora) is being improved rapidly as a result. The venerable 1.x version has for it its reliability and a few more functionalities currently missing in 2.x, however how do they compare in terms of performance?

Procedure

We have measured the performance of Nutch 1.7 against 2.2.1 (HBase and Cassandra) using 3 million URLs from the CommonCrawl project. These URLs were obtained using the Commoncrawl module in Behemoth.

For this test, we are less interested in fetching the entire contents of the 3M crawl database, but rather how performance varies when using common Nutch commands (inject / generate / parse / update). The fetch time is less relevant here as it is mainly network-bound and is affected less by the differences in storage between both versions.

Disclaimer : It is important to note that we are not comparing Cassandra and HBase themselves, but via their respective Gora modules and any conclusions drawn are not necessarily applicable in general. What we are presenting here is what a user gets when using Nutch 2.x with the default configuration for these backends. As we will see later, a lot depends also on the design of Nutch 2.x itself.

Setup

Nutch 1 version: apache-nutch-1.7

Nutch 2 version: apache-nutch-2.2.1

Cassandra version: cassandra-1.2.9

Hbase version: hbase-90.4

We used a large AWS EC2 instance (available at http://aws.amazon.com/ec2/) with 7.5 GB Memory with Hadoop 1.2.0 installed with Apache Whirr. Mapreduce has been configured to allow a maximum of 2 Mappers and 2 Reducers available.

Nutch configuration

To make the different crawls comparable in terms of the handled urls, newly-discovered links on the webpages are not added to the crawl database, but we only fetch the ones that belong to the original 3M. Furthermore, we limit the number of urls per host to 100 and the size of the fetchlist to 5K.

These parameters are set in nutch-site.xml with the following properties:

<property>

<name>db.update.additions.allowed</name>

<value>false</value>

<description>If true, updatedb will add newly discovered URLs, if false

only already existing URLs in the CrawlDb will be updated and no new

URLs will be added.

</description>

</property>

<property>

<name>generate.max.count</name>

<value>100</value>

<description>The maximum number of urls in a single

fetchlist. -1 if unlimited. The urls are counted according

to the value of the parameter generator.count.mode.

</description>

</property>

Note that the the number of urls per fetchlist can also be set in the crawl script, which we run from runtime/deploy/bin. We also removed the lines in the script which were related to indexing operations as these were less relevant for the comparison.

Results

The results can be found in the table below where the values are the averages for each step over 3 runs. The average time per iteration excludes the fetching as explained above. The steps vary a bit between Nutch 1.x and 2.x (e.g. generation done in a singe step, inlinks computed as part of the update in 2.x) but

Task on 3M urls/ 5K per iteration as listed in Nutch 1.x/ 100 urls/host	Nutch 1.7 Time (min) per Tasks in Iteration averaged over 3 runs	Nutch 2.2.1 & Cassandra Time (min) per Tasks in Iteration averaged over 3 runs	Nutch 2.2.1 & Hbase Time (min) per Tasks in Iteration averaged over 3 runs
inject	15.17	85.41	27.25
crawldb	2.20	-	-
Iteration:
generate:select	5.11	11.8	18.54
generate:partition	0.33	-	-
fetch	12.9	34.0	38.9
parse	2.0	7.6	13.64
crawldb:update	3.14	26.3	18.27
linkdb	0.41	-	-
linkdb-merge	1.08 (last 2 it.)	-	-
Avg. per Iteration (min.)	12	45	50
Total Time (min.)	29	130	78

As we can see from these figures, Nutch 1 beats Nutch 2 with both Cassandra (N2C below) and HBase (N2H) on all tasks and with a considerable margin. Who is to be on second place is less clear, as we shall see when looking at the different steps involved in more depth.

Injection is obviously fastest in N1 (17 minutes), while, N2H takes almost double the amount of time for the same task (27 minutes), but this is still exceeded by N2C, where Injection takes a staggeringly 85 minutes.

However, it makes up for it during the iterations, where on average it is about 10 minutes faster than N2H. So if we were to run the crawl with more iterations, the longer time taken for injection (as this is only done once) would take less weight in the total. Within the iteration and except for the updating step, N2C is usually faster than N2H.

The distribution of Mappers and Reducers for each task also stays constant over all iterations with N2C, while with N2Hbase data seems to be partitioned differently in each iteration and more Mappers are required as the crawl goes on. This results in a longer processing time as our Hadoop setup allows only up to 2 mappers to be used at the same time. Curiously, this increase in the number of mappers was for the same number of entries as input.

The number of mappers used by N2H and N2C is the main explanation for the differences between them. To give an example, the generation step in the first iteration took 11.6 minutes with N2C whereas N2H required 20 minutes. The latter had its input represented by 3 Mappers whereas the former required only 2 mappers. The mapping part would have certainly taken a lot less time if it had been forced into 2 mappers with a larger input (or if our cluster allowed more than 2 mappers / reducers) .

Besides the way data are stored, Nutch 1 and 2 differ by their storage strategies. Nutch 1.x has a concept of segments corresponding to a fetchlist, i.e. one round of crawling, separate from the data structure containing the status of the URLs (crawldb) whereas Nutch 2.x stores everything in a single, table-like structure.

The implications of this is that the fetching and parsing steps of Nutch 1.x have the segments as input (i.e 5K URLs in our tests), whereas in Nutch 2.x the whole table (3M URLs) is the input. The way GORA currently operates is that all the entries are given by the backends then filtered on the client side before being submitted as input to the MapReduce Job. When the number of URLs in the table is large, a substantial amount of time is spent getting the content from the backends, filtering them as a preamble to the MapReduce job and discarding most of them in the process as they are not in the current fetchlist.

There is a JIRA issue in GORA about filtering the content on the backend side which would certainly improve things but it does not seem to have been worked on for quite some time.

Conclusions

Although more flexibility in terms of storage is attractive, at the moment this still seems to come at the price of a much lower performance compared to Nutch 1.x, which is also simpler to setup as it does not required to configure GORA and the backends (and the corresponding knowledge and skills).

This also has an impact on the hardware that can be used, as running HBase or Cassandra has an impact on the RAM required. We initially ran this test on a slightly dated laptop (3GB RAM) and could not get it to work successfully with HBase nor Cassandra. The same crawl with Nutch 1.7 ran fine.

Nutch 1.x also has the advantage of having been around for much longer and as a result is a lot more reliable. It also has some features currently missing from 2.x (e.g. pluggable indexing backends).

We can expect the performance in Nutch 2.x to improve a lot as GORA gets more features such as the one mentioned above.

We ran this test on a single server in pseudo distributed mode, but it would be interesting to see what happens on a properly distributed setup.

Monday, 29 July 2013

Nutch training course

~~We are planning to run a 2-day training courses on Apache Nutch on the 24/25 October 2013. It will take place in Bristol, UK (the exact venue will be announced later).~~

The course has been put on hold for now. Please do get in touch if you are interested and I will keep you updated as soon as we reach a sufficient number of attendees.

The course will cover pretty much everything about Nutch from installation and configuration to writing custom resources and will cover both Nutch 1.x and 2.x. The students will learn about best practices for running and managing a Nutch crawl.

Attendees should have some knowledge of JAVA and be comfortable with command line tools to execute basic commands. Some understanding of Hadoop is a plus but not a strict requirement. The course will consist in some hands-on exercises : bring your laptop! Note that the demonstrations and exercises will be based on a Linux OS.

The program given here is an indication only and might change slightly. Feel free to suggest things that you'd like to learn during the course.

Day 1 : NUTCH BASICS

Basic setup
Compilation and dependencies
Main concepts and operational steps
Nutch data structures
Parsing
Indexing
Scoring
Best practices for development and in production

Day 2 : ADVANCED NUTCH

Plugin architecture
Politeness and performance
Metadata in Nutch
Advanced use cases
Introduction to Nutch 2.x

Please contact us on course@digitalpebble.com if you have a question or want to be kept informed of the next date for this course.

Friday, 8 March 2013

Free your Nutch crawls with pluggable indexers

I have just committed what should be a very important new feature of the next 1.x release of Apache Nutch, namely the possibility to implement indexing backends via plugins. This is currently on the trunk only but should hopefully be ported to 2.x at some point. The Nutch-1047 JIRA issue contains a history of patches and discussions for this feature.

As you'll see by reading the explanations below, this is not the same thing as the indexing filters or the storage backends in Nutch 2.x.

Historically, Nutch used to manage its own Lucene indices itself and provide a web interface for querying them. Support for SOLR was added much later in the 1.0 release (NUTCH-442) and users had two separate commands for indexing directly with Lucene or sending the documents to SOLR, in which case the search could be done outside the Nutch search servers and directly with SOLR. We then decided to drop the Nutch search servers and the Lucene-based indexing altogether in Nutch 1.3 (NUTCH-837) and let the SOLR indexer become the only option. This was an excellent move as it greatly reduced the amount of code we had to look after and meant that we could focus on the crawling while benefiting from the advances in SOLR.

One of the nice things about Nutch is that most of its components are based on plugins. The actual plugin mechanism was borrowed from Eclipse and allows to have endpoints and extensions. Nutch has extension points for URLFilters, URLNormalizers, Parsers, Protocols, etc... The full list of Nutch extensions can be found here. Basically pretty much everything in Nutch is done via plugins and I found that most customisations of Nutch I do for my clients are usually implemented via plugins only.

As you've guessed, NUTCH-1047 is about having generic commands for indexing and handling the backend implementations via plugins. Instead of piggybacking the SOLR indexer code to send the documents to a different backend, one can now use the brand new generic IndexingJob and isolate the logic of how the documents are sent to the backend via an extension of the new IndexWriter endpoint in a custom plugin.

The IndexWriter interface is pretty straightforward :

public String describe();

public void open(JobConf job, String name) throws IOException;

public void write(NutchDocument doc) throws IOException;

public void delete(String key) throws IOException;

public void update(NutchDocument doc) throws IOException;

public void commit() throws IOException;

public void close() throws IOException;

Having this mechanism allows us to move most of the SOLR-specific code to the new indexer-solr plugin (and hopefully all of it as soon as we have a generic de-duplicator which could use the IndexWriter plugins) but more importantly will facilitate the implementation of popular indexing backends such as ElasticSearch or Amazon's CloudSearch service without making the core code of Nutch more complex. We frequently get people on the mailing list asking how to store the Nutch documents on such or such database and being able to do that in a plugin will definitely make it easier. It will also be a good way of storing Nutch documents as files, etc...

This is quite a big change to the architecture of Nutch but we tried to make it as transparent as possible for end users. The only indexer plugin currently available is a port of the existing code for SOLR and is activated by default. We left the old solr* commands and modified them so that they use the generic commands with the indexing plugins in the background so from a user point of view there should be no difference at all.

There is already a JIRA for a text-based CSV indexing plugin and I expect that the ElasticSearch one will get rapid adoption.

I had been willing to find the time to work on this for quite some time and I'm very pleased it is now committed, thanks to the comments and reviews I got from my fellow Nutch developers. I look forward to getting more feedback and seeing it being used, extended, improved, etc...

Monday, 9 July 2012

Nutch 2.0 is out (at last!)

Like pretty much any 2.0 release, Nutch 2.0 marks a radical change from the 1.x branch. I've mentioned 2.0 in previous posts but let's do a bit of history first. Nutch was initially started by Doug Cutting (Lucene's creator) and Mike Caffarella around 2002, then came the MapReduce paper from Google and in 2005 MapReduce was implemented as part of Nutch then became a sub project of Lucene at Apache. You know what happened to Hadoop after that : open source super-stardom, millions of dollars in investment, fierce competition between commercial distributions but also a myriad of related projects (HBase, ZooKeeper, Pig, Hive, Mahout etc...) with in the background the emergence of new concepts such as Big Data or NoSQL.

Meanwhile Nutch tagged along following the various releases of Hadoop but was based on the same architecture. It simply started relying on other projects more and more instead of implementing its own stuff, mainly Apache Tika (another offspring of Nutch) for parsing and extracting metadata from various document formats and Apache SOLR for indexing and searching documents. This made the code much lighter, easier to maintain and also up to date with all sorts of functionalities provided by these projects. However the way we stored and access data in Nutch remained the same since the beginning of Hadoop i.e. SequenceFiles and MapFiles.

Nutch 2.0 (a.k.a NutchGora) started in earnest 2 years ago when one of our clients decided to invest in the development of a NoSQL-based version of Nutch. There had been a preliminary version called NutchBase developed by Dogacan Guney which was used as a basis except that instead of relying exclusively on HBase, we decided to implement our own Backend-Neutral-MapReduce-friendly-ORM which is now an Apache Top Level Project known as Apache GORA and serializes data with Apache AVRO. GORA provides us with a unified access to various backends, NoSQL or not, an object-to-datastore mapping mechanism and utilities for MapReduce. This means that Nutch 2.0 can run on HBase, Cassandra, Accumulo or MySQL with just a few configuration files to modify.

One major change in 2.0 is that, instead of having a separation between the status of the URLs (crawlDB) separated from the data for these URLs (content and text in segments) and the webgraph (linkDB), we have a single table-like representation of the data where each entry contains everything we know about a URL, even the links that point to it or the various versions of its content (depending on the backend used). Not having separate segments is definitely good news. One of the side effects is that a fetch or parse step can be resumed.

From a technical point of view this means that Nutch is not limited to the sequential processing of Hadoop data structures but can operate at a more atomic level (GET, PUT). Most Nutch tasks are still MapReduce operations though but at least we can get the backends to filter the data and provide only what is needed for a specific task to the MapReduce operations.

The best example of this that I can think of is the update step in a Nutch crawl. Basically what this step does is to merge the information from a round of fetching with the rest of the CrawlDB, typically to change the status of the URLs we have fetched and add the new URLs we have discovered when parsing. With the 1.x branch this is done with a MapReduce operation which takes both the CrawlDB and the segment as input, reduces on the URLs and updates the status of the CrawlDatum objects in the reduce step. All good. Except that as the crawlDB gets larger and larger, the time taken by the update step gets longer and longer up to a point where it ends up being the slowest part of the crawl. Think about a billion entries in the crawlDB and a single URL to update and you'll get the picture.

There are ways of alleviating this for 1.x (i.e. generate multiple segments in one go and update them with the crawldb at the same time) but the point is that with Nutch 2.0 the equivalent operation would be linear with the number of URLs modified, not the whole crawl dataset.

The change of paradigm between sequential datastructures to a table-like representation is a major change for Nutch which will certainly have many positive side-effects. Being the first release of 2.0, we can expect quite a few fixes to be needed and a massive overhaul of the documentation in the next months but the move seems to be positively welcomed by the Nutch community. Of course 1.x will continue to be the trunk for as long as necessary, i.e. until 2.0 is stable and has all the functionalities that 1.x has.

BTW my slides about 2.0 from last year's Berlin Buzzword are now here.

It is also a symbolic move, with Nutch being at the origin of many successful projects, it was about time it caught up with its famous offspring and the concepts which arose from it.

Wednesday, 13 June 2012

What's new in Nutch 1.5

Apache Nutch 1.5 has been released last week. As with each release, this one contains a lot of changes and I will just comment on a few of them.

The main change is actually not in the list above and has not been documented in the Wiki yet. The binary version of Nutch (apache-nutch-1.5.bin.*) now contains the local runtime only, i.e. what you get in runtime/local when compiling the sources. This should make things a bit more straightforward for beginners as we've seen quite a bit of confusion on the mailing lists about which configuration files should be modified (root/conf vs runtime/local/conf). The src version of Nutch is unchanged and is what you'll need if you want to run Nutch on an existing Hadoop cluster. Of course, the runtime/local directory will be generated too from the source and you'll be able to run Nutch in local mode as well. In a nutshell, if you are not sure about what you're doing, want to use Nutch in local mode without a Hadoop cluster and/or do not need any custom plugins then the binary version is what you're after. I usually recommend to use the distributed version on a pseudo-distributed Hadoop cluster for production as the Hadoop web interfaces provide a wealth of useful information, not mentioning of course that you can have more than one mapper or reducer and harness the full potential of your server.

Apart from the usual dependency updates (Hadoop 1.0.0, Tika 1.1), this release contains many improvements to the webgraph API, which is a better alternative than the default OPIC scoring in Nutch. In the future, it would be interesting to rely on a library such as Apache Giraph to compute the page ranks as it would simplify the code and also make it more efficient.

As mentioned in a previous post, the Nutch user and dev lists seem to indicate an increasing number of users, which is great. This also mean that we tend to see the same questions and issues coming over and over. One such question was about how to parse and index html metatags (see NUTCH-809) which I had contributed 2 years ago. The parse-metatags plugin is now available in the distribution and the steps are documented in the Wiki. Note that the parsing of the html metatags is not activated by default, this is something for the next release maybe.

An important and related change in Nutch 1.5 is NUTCH-1264 which provides a generic plugin for indexing metadata which is typically used alongside parsing plugins such as parse-metatags above and is based on configuration only. The metadata converted into fields for indexing can come from the crawldb, the parse metadata or the content metadata. More work is needed to delegate the indexing parts of existing plugins to it and this is likely to happen in the next release.

Again, Nutch 1.5 contains loads of improvements and you should definitely consider using it if you are on an older version. The next Nutch release will probably be 2.0 for which a RC is already available. Nutch 2.0, a.k.a NutchGora, is a complete redesign of Nutch based on Apache Gora and uses NoSQL datastores as backends instead of relying on the Hadoop data structures. We will have more releases from the 1.x branch as well as 2.x ones, until the latter gets stable and widely used by the community.

As usual, have a look, give it a try and contribute to Nutch if you can.

Friday, 21 October 2011

Nutch hosting and monitoring

We now provide hosting and monitoring services for Apache Nutch.

For a fixed price, we will set up, run and monitor your Nutch crawler and report on its progress. The cost of the servers is included in the offer and their hardware specs are superior to what you get from Amazon EC2, without long term commitment as the service is on a monthly basis only.
The price depends on the size of the cluster as well as the complexity of the crawl.

If you use Nutch to feed documents to a seach engine, we can also monitor and host your SOLR instances for you!

Wednesday, 6 July 2011

Crawler-Commons 0.1 released

As announced on various mailing-lists :

The initial release of crawler-commons is available from : http://code.google.com/p/crawler-commons/downloads/list

The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.

The current version contains resources for :
- parsing robots.txt
- parsing sitemaps
- URL analyzer which returns Top Level Domains
- a simple HttpFetcher

This release is available on Sonatype's OSS Nexus repository [https://oss.sonatype.org/content/repositories/releases/com/google/code/crawler-commons/] and should be available on Maven Central soon.

Please send your questions, comments or suggestions to http://groups.google.com/group/crawler-commons

Doing the release was quite an interesting experience as I'd never done that before. This was the opportunity to have a closer look at ANT+Maven, how to publish artefacts and use Nexus etc... which I am sure will be useful at some point (Behemoth? GORA? Nutch?).

Now that crawler-commons is released we can start using it from Nutch, Bixo [see https://issues.apache.org/jira/browse/NUTCH-1031].

Sunday, 12 June 2011

Nutch 1.3 released + BerlinBuzzwords presentation

Nutch 1.3 has been released and contains quite a few changes, some of which have been retrofitted from Nutch 2.0 in trunk.

The main modification is that Nutch now relies entirely on SOLR for indexing and searching and we removed our indexer based on Lucene as well as the search webapps (NUTCH-837). The dependencies are managed with Apache Ivy (NUTCH-821) and we've upgraded the versions of SOLR to 3.1 and Tika to 0.9. Another important change is that we have two separate runtime environments for local and deployed configurations (NUTCH-843). Nutch 1.3 contains a lot more improvements and bugfixes so if you use Nutch you should probably migrate to it.

The presentation I gave this week at BerlinBuzzwords is now available online and covered both 1.3 and 2.0, as well as an overview of Nutch. The conference itself was great and I met quite a few Nutch users and people who planned to use it as well as Doug Cutting, the creator of Nutch himself!

There are quite a few things planned for the next release(s) and also a large amount of work to do on the documentation which is a bit dated and patchy. Luckily some new committers have recently joined the project and seem keen to help with this.

Saturday, 7 May 2011

Nutch talk at Berlin Buzzwords 2011

I'll be giving a talk on Apache Nutch at Berlin Buzzwords.

This talk will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, Lucene, SOLR, Tika or HBase. The presentation will contain examples of real-case uses.

The second part of the presentation will be focused on the latest developments in Nutch and the changed introduces by the forthcoming version 2.0.

Saturday, 19 March 2011

DigitalPebble is hiring!

We are looking for a candidate with the following skills and expertise :

strong background in NLP and Java
GATE, experience of writing plugins and PRs, excellent knowledge of JAPE
IE, Linked Data, Ontologies
statistical approaches and machine learning
large scale computing with Hadoop
knowledge of the following technologies / tools : Lucene, SOLR, NoSQL, Tika, UIMA, Mahout
good social and presentation skills
good spoken and written English, knowledge of other languages would be a plus
taste for challenges and problem solving

    DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering.

    More details on our activities can be found on our website. We would consider candidates working remotely with occasional travel to Bristol and our clients in UK and Europe. Being located in or near Bristol would be a plus.

    This job is an opportunity to get involved in the growth of a small company, work on interesting projects and take part in various Apache related projects and events. Bristol is also a great place to live.

   Please send your CV and cover letter before the 15th April 2011 to job@digitalpebble.com

    Best regards,

    Julien Nioche

Wednesday, 10 November 2010

Gora in incubation at Apache

Great news! GORA has been accepted in the Apache Incubator in September. It now has a brand new site, JIRA, wiki, subversion repository etc... As I explained in my very first post, GORA has been developed as a part of Nutch 2.0 to provide an abstract storage layer. Think about it as a ORM that can be plugged into a number of storage backends (Cassandra, Hbase, Mysql, etc...). What we also get from it is the ability to use these backends directly into Hadoop's MapReduce without having to write any custom code. Another way of looking at it is that it provides a simple and unified API over these various backends. This would allow for instance to develop a prototype using say, MySQL as a backend then switch to Cassandra when more scalability is needed. Since your application would be based on GORA you would not need to modify any of your code, but just the mapping schema (which is based on Apache Avro).

I was thinking about using HBase in Behemoth to avoid having multiple SequenceFiles but GORA would be a better solution as it would give us more options as to what backend to use. On top of that, we would be able to operate at an atomic level and not by batches only, i.e. process a single document from the store and put it back to the DB. Since Behemoth currently relies on the Hadoop data structures, we can only process a whole corpus and generate a new version as output, which is exactly why we wanted to have GORA in Nutch (imagine you have a 10+ billion crawlDB and add say 10M pages per fetch round - every update step in Nutch 1.x requires to read 1010M entries and write out between 1000 and 1010M; a bit wasteful isn't it? )

Assuming that we use GORA (and the AVRO schema for the Behemoth documents), we could then implement a custom Datastore in GATE to debug a Behemoth corpus or test a GATE application.

Now that GORA is in Apache-land, it will hopefully get more contributors involved and more back ends supported.

Monday, 27 September 2010

SimilarPages is out!

It's always nice to see clients emerging of stealth mode and showing the fruits of their labour to the public. Our friends at http://www.similarpages.com have just done so and I am doubly pleased as this also reflect the work that DigitalPebble did for them.

SimilarPages is an add-on for Firefox which allows you to discover pages which are similar to the one you are currently reading. It's pretty cool and surprisingly easy to use. It's also completely free which is always nice.

From a technical point of view, we helped SimilarPages adapting Nutch to their needs and deploying it on a 400-nodes cluster on Amazon EC2 to crawl the web. We fetched and parsed a total of more than 3 billion pages from which we obtained 200+ million lists of similarities. The crawlDB itself contained more than 10 billion URLs.

Operating at such a scale is definitely challenging and has been a great experience. From a Nutch point of view, quite a few improvements and bugfixes in Nutch 1.1 come directly from the work done at SimilarPages, so thanks for that guys!

The SimilarPages use case is actually a good example of using Nutch as a crawling platform only, i.e. not indexing the documents with Lucene or SOLR. See A. Białecki's presentation at BerlinBuzzwords 2010 for more examples on this subject.

The computation of the similarities between URLs from the Nutch crawls is made using bespoke Hadoop Map-Reduce code. More details on their approach can be found on SimilarPages' website.

I feel quite proud to have contributed to this project and wish long live to SimilarPages. If you haven't done so, give it a try!

Apache Nutch 1.2 released

[quoting the announcement by Chris Mattmann]

The Apache Nutch project is pleased to announce the release of Apache Nutch
1.2. The release contents have been pushed out to the main Apache release
site so the releases should be available as soon as the mirrors get the
syncs.

Apache Nutch, one of the six new Apache TLPs as a result of the April 2010
Board Meeting, is an extensible framework for building out large-scale
web-based search. Layered on top of fellow Apache projects Hadoop,
Lucene/Solr, and Tika, Nutch provides an out of the box platform for
fetching web pages, pdf files, word documents, and more. Nutch parses the
content and its relevant information, indexes its metadata, and makes it
available for efficient query and retrieval over modern Internet protocols.

Apache Nutch 1.2 contains a number of improvements and bug fixes. Details
can be found in the changes file:

http://www.apache.org/dist/nutch/CHANGES-1.2.txt

Apache Nutch is available in source and binary form from the following
download page: http://www.apache.org/dyn/closer.cgi/nutch/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the downloads
using signatures found on the Apache site:

http://www.apache.org/dist/nutch/KEYS-1.2.txt

For more information on Apache Nutch, visit the project home page:
http://nutch.apache.org

Friday, 13 August 2010

Towards Nutch 2.0

Nevermind the dodgy look of the blog - I'll improve that later!

For my first post, I'd like to mention the progress we've made recently towards Apache Nutch 2.0. It is based on a branch named NutchBase which has been developed mainly by Doğacan Güney and it is now in the trunk of the SVN repository. One of the main aspects of Nutch 2.0 is that it is now storing its data in a datastore and not in Hadoop's file-based structures. Note that we still have the distribution and replication of the data over a whole cluster and data locality for MapReduce but we also have the possibility to insert or modify a random entry in the table without having to read/write the whole data structure as it was the case before.

Nutch uses a project named GORA as an intermediate between our code and the backend storage. There would be a lot of things to say on GORA but to make it short what we are trying to achieve with it is to make it a sort of common API for NoSQL stores. GORA already has implementations for HBase and Cassandra but also SQL. The plan for GORA is to put it in the Apache Incubator or possibly as an Apache subproject (Hadoop? HBase? Cassandra?). We'll see how it goes.

There are quite a few structural changes in Nutch, most notably the fact that there aren't any segments any more as all the information about a URL (metadata, original content, extracted text, ...) in stored in a single table which means for instance no more segments to merge or metadata to move back to the crawldb. It's all in one place!

There are other substantial changes in 2.0, notably the removal of the Lucene-based indexing and search as we now rely on SOLR. Other indexing backends might be added later. Another step towards delegating functionalities to external projects is the increased used of Apache Tika for the parsing. We've removed quite a few legacy parsers from Nutch and let Tika do the work for us. We've also revamped the organisation of the code and did a lot of code clean up.

Nutch 2.0 is still at an early stage and we are actively working on it, testing, debugging etc... The good news is that it is not only an architectural change but also a basis for a whole lot of new functionalities (see for instance https://issues.apache.org/jira/browse/NUTCH-882).

I'll keep you posted on our progress, as usual : give it a try, get involved, join us...