Tuesday, 10 January 2017

What's new in StormCrawler 1.3

StormCrawler 1.3 has just been released! As usual, all users are advised to upgrade to this version as it fixes some bugs and contains quite a few new functionalities and improved performance.

Dependencies upgrades

Jsoup 1.10.1
Crawler-Commons 0.7
RomeTools to 1.7.0
ICU4J 58.2

Core module

Hardcoded limit to the max # connections allowed by protocol #388
LangID module #364
JsoupParserBolt can use first N bytes for charset detection (or not at all) #391
SimpleFetcherBolt uses allowRedir from super class #394 (bugfix)
URLNormalizer : Decode non-standard percent encoding prior to re-encoding
MaxDepthFilter defaults to -1, 0 removes all outlinks, can set a custom max depth per URL with max.depth. Implements #399 and #400

The latter breaks compatibility with the previous versions: 0 was used to deactivate the filtering by depth, whereas now it is used to prevent any outlinks from being processed. Please change your config to -1 if you want to deactivate the filtering.

Elasticsearch

Flux for crawl and injection topologies #372
Use min delay for all types of Spouts #370
Remove Node client #377
ESSpout deals with deep paging before building query
Topology status updater triaged by URL to hit cache
Settings done via configuration #376
Add plugin to the clients via configuration #378
Spouts: load results with a non-blocking call #371
Concurrent requests in config #382
StatusUpdaterBolt - do not add URL already in buffer for ES if status is DISCOVERED
Allow fieldNameForRoutingKey to be outside metadata and use a different key for spouts #384
Use SHA256 as doc_id #385
Separate Kibana schema for status and metrics + put all schemas in a separate folder
Improvements to ES_IndexInit
ES crawl topology uses FetcherBolt

Please note that the cluster name is now defined alongside the other settings:

  es.status.settings:
    cluster.name: "elasticsearch"

One of the benefits of #376 and #378 is that you can now use StormCrawler with Elastic Cloud protected with Shield.

We are fast approaching our 1.000th commit! Thanks to all users and contributors for their help with StormCrawler. Happy crawling!

PS: I will be running a 1-day workshop in Berlin on the 2nd of February. Announcements will be made on our Twitter account.

Tuesday, 3 January 2017

The Battle of the Crawlers : Apache Nutch vs StormCrawler

Happy New Year everyone!

For this first blog post of 2017, we'll compare the performance of StormCrawler and Apache Nutch. As you probably know, these are open source solutions for distributed web-crawling and we provided an overview last year of both as well as a performance comparison when crawling a single website.

StormCrawler has been steadily gaining in popularity over the last 18 months and a frequent question asked by prospective users is how fast it is compared to Nutch. Last year's blog post provided some insights into this but now we'll go one step further by crawling not a single website, but a thousand. The benchmark will still be on a single server though but will cover multi-million pages.

Disclaimer: I am a committer on Apache Nutch and the author of StormCrawler.

Meet the contestants

Please have a look at our previous blog post for a more detailed description of both projects. This Q and A should also be useful.

Apache Nutch is a well-established web crawler based on Apache Hadoop. As such, it operated by batches with the various aspects of web crawling done as separate steps (e.g. generate a list of URLs to fetch, fetch, parse the web pages and update its data structures.

In this benchmark, we'll use the 1.x version of Nutch. There is a 2.x branch but as we saw in a previous benchmark, it is a lot slower. It also lacks some of the functionalities of 1.x and is not actively maintained.

StormCrawler, on the other hand, is based on Apache Storm, a distributed stream processing platform. All the web crawling operations are done continuously and at the same time.

What we can assume (and observed previously) is that StormCrawler should be more efficient as Nutch does not fetch web pages continuously, but only as one of the various batch steps. On top of that, some of its operations - mainly the ones that deal with the crawldb, the datastructure used by Nutch - take increasingly longer as the size of the crawl grows.

The battleground

We ran the benchmark on a dedicated server provided by OVH with the following specs :

Intel Xeon E5 E5 1630v3 4c/8t 3,7 / 3,8 GHz

64 GB of RAM DDR4 ECC 2133 MHz

2x480GB RAID 0 SSD

Ubuntu 16.10 server

We installed the following software

Apache Storm 1.0.2

Elasticsearch 2.4

Kibana 4.6.3

StormCrawler 1.3-SNAPSHOT

Hadoop 2.7.3

Apache Nutch 1.13-SNAPSHOT

Finally, the resources and configurations for the benchmark can be found on the first release of https://github.com/DigitalPebble/stormcrawlerfight.

We followed the recommendations from

for the configuration of Elasticsearch on SSD and gave it 10GB RAM to run on.

Apache Nutch

The configuration for Nutch can be found in the GitHub repo under the nutch directory. This should allow you to reproduce the benchmarks if you wished to do so.

The main changes to the crawl script, apart from the addition of a contribution I recently made to Nutch, was to :

set the number of fetch threads to 500
change the max size of the fetchlist to 50,000,000
use 4 reducer tasks
remove the link inversion and dedup steps

The latter was done in order to keep the crawl to a minimum. We left the setting for the limitation of fetch time to 3 hours. The aim of this was to avoid long tails in the fetching step, where the process is busy fetching from only a handful of slow servers.

In order to optimise the crawl, we limited the number of URLs per hostname in the fetchlist to 100, which guarantees a good distribution of URLs and again, prevents the long tail phenomenon, which is commonly observed with Nutch. We also tried to avoid the conundrum whereby setting too low a duration for the fetching step requires more crawl iterations, meaning that more generate and update steps are necessary.

Note: we initially intended to index the documents into Elasticsearch, however, this step proved unreliable and caused errors with Hadoop. We ended up deactivating the indexing step from the script, which should benefit Nutch when comparing to StormCrawler.

We ran 10 crawl iterations between 2016.12.16 11:18:37 CET and 2016.12.17 19:29:30 CET, the breakdown of times per step is as follows :

Iteration #	Steps	Time
1	Generation	0:00:38
	Fetcher	0:00:45
	Parse	0:00:26
	Update	0:00:24
2	Generation	0:00:40
	Fetcher	0:23:26
	Parse	0:01:53
	Update	0:00:30
3	Generation	0:00:52
	Fetcher	0:55:46
	Parse	0:08:24
	Update	0:01:07
4	Generation	0:01:15
	Fetcher	1:08:36
	Parse	0:19:00
	Update	0:02:01
5	Generation	0:02:03
	Fetcher	2:14:20
	Parse	0:47:59
	Update	0:04:34
6	Generation	0:03:55
	Fetcher	4:20:02
	Parse	1:30:58
	Update	0:08:46
7	Generation	0:06:44
	Fetcher	3:52:36
	Parse	1:09:40
	Update	0:08:15
8	Generation	0:08:31
	Fetcher	3:48:35
	Parse	1:04:27
	Update	0:08:32
9	Generation	0:10:00
	Fetcher	3:51:57
	Parse	1:18:38
	Update	0:09:27
10	Generation	0:11:44
	Fetcher	3:32:35
	Parse	0:59:13
	Update	0:09:39
		33:08:53

What you can observe is that the generate and update steps do take an increasingly longer time, as mentioned above.

The final stats from the final update step were :

db_fetched	10,626,298
db_gone	686,834
db_redir_perm	123,087
db_redir_temp	217,191
db_unfetched	64,678,627

which gives us a total of 11,653,410 URLs processed (fetch + gone + redirs) in a total time of 1930 minutes.

On average, Nutch fetched 6,038 URLs per minute.

The graph below shows the bandwidth usage of the server when the Nutch crawl was running.

Network graph of Nutch crawl

This is a good illustration of the batch nature of Nutch, where the fetching is only one part of the whole process.

Let's now see how StormCrawler fared in a similar situation.

StormCrawler

StormCrawler can use different backends for storing the status of the URLs (i.e. which is what the crawldb does in Nutch). For this benchmark, we used the Elasticsearch module of StormCrawler as it is the most commonly used. This means that we won't just be storing the content of the webpages to Elasticsearch, we'll also be using it to store the status of the URLs as well as displaying metrics about the crawl with Kibana.

We ran the crawl for over 2 and a half days and got the following values in the status index

DISCOVERED	188,396,525
FETCHED	32,656,149
ERROR	2,901,502
REDIRECTION	2,050,757
FETCH_ERROR	1,335,437

which means a total of 38,943,845 webpages processed over 3977minutes, i.e. an average of 9792.26 pages per min.

The network graph looked like this:

Network graph of StormCrawler crawl

which, apart from an unexplained and possibly unrelated spike on Christmas day, shows a pretty solid use of the bandwidth. Whereas Nutch was often around the 50M mark, StormCrawler is lower but constant.

The metrics stored in Elasticsearch and displayed with Kibana gave a similar impression:

StormCrawler metrics displayed with Kibana

Interestingly, the Storm UI indicated that the bottleneck of the pipeline was the update step, which is not unusual given the 'write-heavy' nature of StormCrawler.

Conclusion

This benchmark as set out above shows that StormCrawler is 60% more efficient than Apache Nutch. We also found StormCrawler to run more reliably than Nutch but this could be due to a misconfiguration of Apache Hadoop on the test server. We had to omit the indexing step from the Nutch crawl script because of reliability issues, whereas the StormCrawler topology did index the documents successfully. This would have added to the processing time of Nutch.

The main explanation lies in the design of the crawlers: Nutch achieves greater spikes in the fetching step but does not fetching continuously as StormCrawler does. I had compared Nutch to a sumo and StormCrawler to a ninja previously but it seems that the tortoise and hare parable would be just as appropriate.

It is important to bear in mind that the raw performance of the crawlers is just one aspect of an overall comparison. One should also consider the frequency of releases and contributions as well as more subjective aspects such as the ease of use and versatility. There is also of course the question of the functionalities provided. To be fair to Nutch, it currently does thing that StormCrawler does not yet support such as document deduplication and scoring. On the other hand, StormCrawler too has a few aces up its sleeve with Xpath extraction, sitemap processing and live monitoring with Kibana.

As often said in similar situations: “your mileage might vary”. The figures given here depend on the particular seed list and hardware, you might get different results on your specific use case. The resources and configurations of the benchmark being publicly available, you can reproduce it and extend it as you wish.

Hopefully we’ll run more benchmarks in the future. These could cover larger scale crawling in fully distributed mode and/or comparing different backends for StormCrawler (e.g. Redis+RabbitMQ vs Elasticsearch).

Happy crawling!

Monday, 31 October 2016

What's new in StormCrawler 1.2

StormCrawler 1.2 has been released today after a busy and exciting month, the highlight of which was certainly the announcement by CommonCrawl of the news dataset, which is powered by StormCrawler. This helped raise the profile of the project and also brought various improvements to the WARC and Elasticsearch modules (see below). Another great news was that my talk got accepted for ApacheCon BigData next month in Seville which prompted a Q&A interview on Linux.com.

Back to the content of the release. There have been many improvements on various levels, the main one being that the WARC module was moved to the main repository [#313]. It got many bugfixes and improvements since used by CommonCrawl and is now stable enough to join the other external modules.

We recommend all users to upgrade their configuration to the version 1.2 of StormCrawler.

Apart from minor bug fixes, the main changes in this new version are :

Core

Removes StatusStreamBolt [#341]

New Parse Filters :

MD5 signature [#354]
DomainParseFilter [#356]

URL Filters

URL Normalisation - remove parameters where the value is a 32-bit hash [#363]
Filtering : treat path parameters as query parameters [#366]
BasicURLFilter to remove URLs based on path repetition and max length [#368]

Add metadata.discoveryDate field to enable tracking discovery rate [#360]

Add super class for bolts using the status stream [#353]

JSoup Handle redirections via meta tag [#350]

Tika

Upgraded to Tika 1.13 [#285]

Combine JSoupParser with Tika [#357]

Tika parser can now parse embedded documents [#358]

Elasticsearch

Elasticsearch upgraded to 2.4.1

Metadata keys with multiple values not indexed correctly in ES [#345]

Refactoring into AbstractSpout for ES [#348]

Status index - fields stored unnecessarily [#351]

Cache URLs post ack/fail [#349]

The latter is a substantial change to the way the Elasticsearch spouts work. All 3 flavours of spouts hold a cache of the URLs being processed and use it to make sure that any URLs returned by a query are not added twice. This worked OK but did not cater for situations where a URL was towards the bottom of the buffer and acked/failed not long before the buffer was refilled from ES. In such cases, the changes to the status index hadn't had the time to be committed to the underlying index and as a result, the same URL was returned in the next query. This resulted in 10 to 15% of URLs being unnecessarily re-fetched in a short delay. What #349 does is that after ack/failing, URLs are kept in the cache for an extra N-seconds, to give time for the changed to be reflected in the search results (this is of course configurable via es.status.ttl.purgatory).

Coming next?

The releases seem to come more and more frequently. It is not sure yet what the next one will have in store but I am sure the discussions at ApacheCon as well as constant stream of new users will provide new functionalities and bugfixes.

In the meantime and as usual, thanks to all contributors and users and happy crawling!

Monday, 19 September 2016

What's new in StormCrawler 1.1

The 1.1 release comes 2 months after the previous one and is relatively lightweight by comparison. The main changes are :

Dependency upgrades

Jackson Databind (2.6.6) and Apache Storm (1.0.2)

Core

HTTP protocol : store response headers verbatim in metadata (#317) - used by the WARC module
FetcherBolt - added option to throttle based on number of URLs in queues (#311)
Conventional 'never-refetch' Date for nextFetchDate (#331)
Added metadata.lastProcessedDate
Deprecated StatusStreamBolt and copied as DummyIndexer

Elasticsearch

Bugfix Flush BulkProcessor before closing connection (#320)
Added per day / month metrics consumer (#327)

Archetype

Generate real jar name in README
Proper handling of user-provided package names (#326)
POM for projects from archetypes should use Java 8 (#325)

There have been several minor changes as well.

Remember that you can get regular updates about major commits on the project by following us on Twitter @stormcrawlerapi..

BTW there should be an exciting announcement in the next couple of weeks about a cool use of StormCrawler by a high-profile user, watch this space!

As usual, thanks to all contributors and users and happy crawling!

PS: If you are near the Bristol, you might be interested in coming to the talk I'll be giving at Bristech on the Oct 6th.

NOTE

A patch release 1.1.1 has been published on the 21st Sept and includes #335 (thanks to Jeff Bolle for pointing it out).

Friday, 9 September 2016

Index the web with StormCrawler (revisited)

I hope you all had an enjoyable summer. I can't believe it's not even a year since I published the (relatively popular) post on Index the web with AWS CloudSearch! At the time we had just released the version 0.6 of StormCrawler and the post explained how to use SC to crawl a website and index it with CloudSearch. The tutorial also covered the same operations with Apache Nutch and helped users understand how the two projects differ.

StormCrawler has evolved a lot in just one year! In won't go into much details as this is explained in the previous posts but we are now at version 1.0, have a proper website for the project (albeit in constant need of improvements), a logo, many new resources, including a Maven archetype. The latest of these new resources is a new module for the popular Redis data structure store.

Last year's post is now quite outdated as a result so we'll now revisit the same use case (crawling http://www.tescobank.com/) but this time using Redis to store the crawl frontier and URL information and bootstrapping the project with the Maven archetype. This time, we won't index its content with Cloudsearch (or anything else) to keep the configuration to a minimum.

If you are new to StormCrawler, please read the Cloudsearch post for an introduction as well as the material on the website. Please bear in mind that although this short tutorial covers a single website processed with a single machine, StormCrawler is distributed by nature (thanks to Apache Storm) and can run on a cluster to deal with millions of pages.

Prerequisites

The instructions below are based on a Linux distribution. You will need to install the following software :

Java 8
Apache Maven (http://maven.apache.org/)
Redis 2.8.1 (http://redis.io/)
Apache Storm 1.0.1 (http://storm.apache.org/)

The Redis module is currently a PR with the code stored in a separate branch. This might be merged based on user feedback and be available in the next release but until then you can download the code for the redis branch or clone it with Git. Unzip the archive and from its root dir call 'mvn clean install', this will put all the necessary jars in your local repository.

The Storm command must be on your PATH, you don't need its servers to be running if you only want to run the crawls in local (non-distributed) mode.

Redis

Apache Storm as a framework is 'source agnostic' (if such a term exists), all it expects is that the Spout implementations provide the topologies with a steady stream of tuples. In the case of StormCrawler, these tuples are of the form <URL, Metadata>. Depending on the use case, they might come from a distributed queue (e.g. RabbitMQ), a database (MySQL) or a search system (Elasticsearch, SOLR).

The choice of tool to use depends on the following factors :

do you follow outlinks?
if so, is the crawl recursive i.e. can you get to the same URL via different pages?
do you need to revisit the pages?

StormCrawler provides a number of resources in its external plugins and so does Apache Storm itself.

Often the same data structure is used to both persist the information we have about the URLs and queue the URLs to be fetched (crawl frontier). With a key / (structured) value store like Redis we can use a slightly different strategy and separate the crawl frontier from the status of the URLs. For the frontier, we use keys with the prefix 'q_' followed by the host or domain name with a List of URLs to fetch as value. The Spout iterates on the queue entries and removes the head of the queue to send them as Tuples in the topology. This has the advantage of guaranteeing a perfect diversity of URLs in the topology and hence optimal performance. We also the information about the URLs with the prefix 's_' followed by the URL, the value associated with such keys is a String containing the status and metadata of the URLs. When discovering new URLs in recursive crawls, we can check whether the URL is already known, in which case we won't add it to the queues again.

One limitation of our Redis spout and updater is that we can't reschedule URLs for revisiting them but for many use cases, this is absolutely fine.

Let's get started. With the Redis server running, open a client session with redis-cli and type

FLUSHALL RPUSH q_www.tescobank.com http://www.tescobank.com/

We'll skip the creation of the corresponding s_http://www.tescobank.com/ entry out of pure laziness. It will get created once the URL is fetched and its status updated. What we just did is that we created a queue for the host tescobank.com with a list as value which contains a single URL to fetch.

Boostrap a project with an archetype

Instead of having to build everything from scratch we'll use our Maven archetype to bootstrap our crawl project. From anywhere you want on your filesystem do :

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.1-SNAPSHOT -DgroupId=net.stormcrawler -DartifactId=redis-crawler -Dversion=1.0

and press enter to confirm. Change the directory to redis-crawler, you should see a basic set of config and resource files.

├── crawler-conf.yaml ├── crawler.flux ├── pom.xml ├── README.md └── src └── main ├── java │ └── net │ └── stormcrawler │ └── CrawlTopology.java └── resources ├── default-regex-filters.txt ├── default-regex-normalizers.xml ├── parsefilters.json └── urlfilters.json

This will be the starting point for our modifications.

Customisation of resources

Edit the pom.xml file and add the redis module to the list of dependencies

<dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-redis</artifactId> <version>1.1-SNAPSHOT</version> </dependency>

Next, we'll edit crawler-conf.yaml and add the following configuration parameters :

http.content.limit: -1

fetcher.server.delay: 2.0

redis.status.max.urls.per.bucket: 5

Let's now edit urlfilters.json by setting ignoreOutsideHost to true and adding

{ "class": "com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter", "name": "RobotsFilter", "params": { } }

to the list of filters (don't forget to add a comma before this section!).

Next add https to the second line of regex-filter.txt, and finally set the content of regex-normalizers.xml to

<?xml version="1.0"?>

<regex-normalize>

<regex>

<pattern>\?.+</pattern>

<substitution></substitution>

</regex>

</regex-normalize>

Everything should now be similar to the configuration of last year's tutorial.

Customisation of the topology class

The example topology has a simple memory based Spout which reads preset URLs and a dummy StatusUpdater. What we need to do instead is to use the Redis-based equivalents.

Replace the spout declaration in CrawlTopology.java (line 49) with

builder.setSpout("spout", new com.digitalpebble.stormcrawler.redis.RedisSpout());

and StdOutStatusUpdater() (line 67) with com.digitalpebble.stormcrawler.redis.StatusUpdaterBolt().

For good measure let's get rid of StdOutIndexer() and use com.digitalpebble.stormcrawler.indexing.DummyIndexer() instead.

Run the crawl

First, let's build an uber-jar with

mvn clean package

Then as mentioned in the README.md we can start the crawl with :

storm jar target/redis-crawler-1.0.jar net.stormcrawler.CrawlTopology -conf crawler-conf.yaml -local

or without '-local' if Storm is running as a (pseudo?) distributed cluster and we want the benefits of the UI, proper logging etc...

You should see the usual log entries and metrics info scrolling on the console. For a better understanding of what's going on, open a console and type

redis-cli KEYS s_* | sort

to see all the URLs discovered during the crawl, regardless of whether they were fetched or not.

To see the entire content of the fetch queue, do :

redis-cli LRANGE q_www.tescobank.com 0 -1

whereas

redis-cli LLEN q_www.tescobank.com

returns its size only. When it returns 0, your crawl is finished and you can kill the process with CTRL-C or do it with STORM KILL if you are running in distributed mode.

Conclusion

Looking back at last year's post, I realised that StormCrawler has evolved a lot since and the previous instructions were not quite up to date. The archetype, for instance, is a good way of getting started with StormCrawler and provides a solid starting point. There are also a lot more useful resources that users can leverage, including our brand new Redis components.

The approach we used for Redis where 2 different sets of keys are used for the crawl frontier and the URLs status could be reused for other key value stores like HBase which do not necessarily have secondary indices that we can use for sharding the URLs per host and guarantee a good diversity of URLs in the topology.

Please do get involved and help reviewing the PR for Redis. If you have any questions or problems, see http://stormcrawler.net/support/.

Happy crawling!

PS: Exercise for the reader

You probably noticed the 'crawler.flux' file in the directory generated by the artefact. Flux is a recent addition to Apache Storm : instead of defining a topology via a Java class as we did above, Flux allows you to do define the components of the topology and their interactions via a yaml file in a language-neutral way. Better, it means that you don't need to recompile the code if you change something in your topology (you'll still need to turn it off and restart it though).

The crawler.flux file corresponds to the default topology which we modified above to use Redis. As stated in the README, you can start a topology in the following way :

storm jar target/redis-crawler-1.0.jar org.apache.storm.flux.Flux --local crawler.flux

As an exercise, why don't you have a look at the Flux file and modify it so that it runs the Redis-based topology?

PPS: Exercise for the reader #2

Crawling is very nice but unless you store or index the documents you crawl it remains a pretty pointless exercise. Why don't you modify the crawl above so that it sends the data to Elasticsearch, SOLR or Cloudsearch? If you are feeling adventurous you could also try the WARC module and generate some great web archives to play with.