DigitalPebble's Blog

Monday, 31 October 2016

What's new in StormCrawler 1.2

StormCrawler 1.2 has been released today after a busy and exciting month, the highlight of which was certainly the announcement by CommonCrawl of the news dataset, which is powered by StormCrawler. This helped raise the profile of the project and also brought various improvements to the WARC and Elasticsearch modules (see below). Another great news was that my talk got accepted for ApacheCon BigData next month in Seville which prompted a Q&A interview on Linux.com.

Back to the content of the release. There have been many improvements on various levels, the main one being that the WARC module was moved to the main repository [#313]. It got many bugfixes and improvements since used by CommonCrawl and is now stable enough to join the other external modules.

We recommend all users to upgrade their configuration to the version 1.2 of StormCrawler.

Apart from minor bug fixes, the main changes in this new version are :

Core

Removes StatusStreamBolt [#341]

New Parse Filters :

MD5 signature [#354]
DomainParseFilter [#356]

URL Filters

URL Normalisation - remove parameters where the value is a 32-bit hash [#363]
Filtering : treat path parameters as query parameters [#366]
BasicURLFilter to remove URLs based on path repetition and max length [#368]

Add metadata.discoveryDate field to enable tracking discovery rate [#360]

Add super class for bolts using the status stream [#353]

JSoup Handle redirections via meta tag [#350]

Tika

Upgraded to Tika 1.13 [#285]

Combine JSoupParser with Tika [#357]

Tika parser can now parse embedded documents [#358]

Elasticsearch

Elasticsearch upgraded to 2.4.1

Metadata keys with multiple values not indexed correctly in ES [#345]

Refactoring into AbstractSpout for ES [#348]

Status index - fields stored unnecessarily [#351]

Cache URLs post ack/fail [#349]

The latter is a substantial change to the way the Elasticsearch spouts work. All 3 flavours of spouts hold a cache of the URLs being processed and use it to make sure that any URLs returned by a query are not added twice. This worked OK but did not cater for situations where a URL was towards the bottom of the buffer and acked/failed not long before the buffer was refilled from ES. In such cases, the changes to the status index hadn't had the time to be committed to the underlying index and as a result, the same URL was returned in the next query. This resulted in 10 to 15% of URLs being unnecessarily re-fetched in a short delay. What #349 does is that after ack/failing, URLs are kept in the cache for an extra N-seconds, to give time for the changed to be reflected in the search results (this is of course configurable via es.status.ttl.purgatory).

Coming next?

The releases seem to come more and more frequently. It is not sure yet what the next one will have in store but I am sure the discussions at ApacheCon as well as constant stream of new users will provide new functionalities and bugfixes.

In the meantime and as usual, thanks to all contributors and users and happy crawling!

Monday, 19 September 2016

What's new in StormCrawler 1.1

The 1.1 release comes 2 months after the previous one and is relatively lightweight by comparison. The main changes are :

Dependency upgrades

Jackson Databind (2.6.6) and Apache Storm (1.0.2)

Core

HTTP protocol : store response headers verbatim in metadata (#317) - used by the WARC module
FetcherBolt - added option to throttle based on number of URLs in queues (#311)
Conventional 'never-refetch' Date for nextFetchDate (#331)
Added metadata.lastProcessedDate
Deprecated StatusStreamBolt and copied as DummyIndexer

Elasticsearch

Bugfix Flush BulkProcessor before closing connection (#320)
Added per day / month metrics consumer (#327)

Archetype

Generate real jar name in README
Proper handling of user-provided package names (#326)
POM for projects from archetypes should use Java 8 (#325)

There have been several minor changes as well.

Remember that you can get regular updates about major commits on the project by following us on Twitter @stormcrawlerapi..

BTW there should be an exciting announcement in the next couple of weeks about a cool use of StormCrawler by a high-profile user, watch this space!

As usual, thanks to all contributors and users and happy crawling!

PS: If you are near the Bristol, you might be interested in coming to the talk I'll be giving at Bristech on the Oct 6th.

NOTE

A patch release 1.1.1 has been published on the 21st Sept and includes #335 (thanks to Jeff Bolle for pointing it out).

Friday, 9 September 2016

Index the web with StormCrawler (revisited)

I hope you all had an enjoyable summer. I can't believe it's not even a year since I published the (relatively popular) post on Index the web with AWS CloudSearch! At the time we had just released the version 0.6 of StormCrawler and the post explained how to use SC to crawl a website and index it with CloudSearch. The tutorial also covered the same operations with Apache Nutch and helped users understand how the two projects differ.

StormCrawler has evolved a lot in just one year! In won't go into much details as this is explained in the previous posts but we are now at version 1.0, have a proper website for the project (albeit in constant need of improvements), a logo, many new resources, including a Maven archetype. The latest of these new resources is a new module for the popular Redis data structure store.

Last year's post is now quite outdated as a result so we'll now revisit the same use case (crawling http://www.tescobank.com/) but this time using Redis to store the crawl frontier and URL information and bootstrapping the project with the Maven archetype. This time, we won't index its content with Cloudsearch (or anything else) to keep the configuration to a minimum.

If you are new to StormCrawler, please read the Cloudsearch post for an introduction as well as the material on the website. Please bear in mind that although this short tutorial covers a single website processed with a single machine, StormCrawler is distributed by nature (thanks to Apache Storm) and can run on a cluster to deal with millions of pages.

Prerequisites

The instructions below are based on a Linux distribution. You will need to install the following software :

Java 8
Apache Maven (http://maven.apache.org/)
Redis 2.8.1 (http://redis.io/)
Apache Storm 1.0.1 (http://storm.apache.org/)

The Redis module is currently a PR with the code stored in a separate branch. This might be merged based on user feedback and be available in the next release but until then you can download the code for the redis branch or clone it with Git. Unzip the archive and from its root dir call 'mvn clean install', this will put all the necessary jars in your local repository.

The Storm command must be on your PATH, you don't need its servers to be running if you only want to run the crawls in local (non-distributed) mode.

Redis

Apache Storm as a framework is 'source agnostic' (if such a term exists), all it expects is that the Spout implementations provide the topologies with a steady stream of tuples. In the case of StormCrawler, these tuples are of the form <URL, Metadata>. Depending on the use case, they might come from a distributed queue (e.g. RabbitMQ), a database (MySQL) or a search system (Elasticsearch, SOLR).

The choice of tool to use depends on the following factors :

do you follow outlinks?
if so, is the crawl recursive i.e. can you get to the same URL via different pages?
do you need to revisit the pages?

StormCrawler provides a number of resources in its external plugins and so does Apache Storm itself.

Often the same data structure is used to both persist the information we have about the URLs and queue the URLs to be fetched (crawl frontier). With a key / (structured) value store like Redis we can use a slightly different strategy and separate the crawl frontier from the status of the URLs. For the frontier, we use keys with the prefix 'q_' followed by the host or domain name with a List of URLs to fetch as value. The Spout iterates on the queue entries and removes the head of the queue to send them as Tuples in the topology. This has the advantage of guaranteeing a perfect diversity of URLs in the topology and hence optimal performance. We also the information about the URLs with the prefix 's_' followed by the URL, the value associated with such keys is a String containing the status and metadata of the URLs. When discovering new URLs in recursive crawls, we can check whether the URL is already known, in which case we won't add it to the queues again.

One limitation of our Redis spout and updater is that we can't reschedule URLs for revisiting them but for many use cases, this is absolutely fine.

Let's get started. With the Redis server running, open a client session with redis-cli and type

FLUSHALL RPUSH q_www.tescobank.com http://www.tescobank.com/

We'll skip the creation of the corresponding s_http://www.tescobank.com/ entry out of pure laziness. It will get created once the URL is fetched and its status updated. What we just did is that we created a queue for the host tescobank.com with a list as value which contains a single URL to fetch.

Boostrap a project with an archetype

Instead of having to build everything from scratch we'll use our Maven archetype to bootstrap our crawl project. From anywhere you want on your filesystem do :

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.1-SNAPSHOT -DgroupId=net.stormcrawler -DartifactId=redis-crawler -Dversion=1.0

and press enter to confirm. Change the directory to redis-crawler, you should see a basic set of config and resource files.

├── crawler-conf.yaml ├── crawler.flux ├── pom.xml ├── README.md └── src └── main ├── java │ └── net │ └── stormcrawler │ └── CrawlTopology.java └── resources ├── default-regex-filters.txt ├── default-regex-normalizers.xml ├── parsefilters.json └── urlfilters.json

This will be the starting point for our modifications.

Customisation of resources

Edit the pom.xml file and add the redis module to the list of dependencies

<dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-redis</artifactId> <version>1.1-SNAPSHOT</version> </dependency>

Next, we'll edit crawler-conf.yaml and add the following configuration parameters :

http.content.limit: -1

fetcher.server.delay: 2.0

redis.status.max.urls.per.bucket: 5

Let's now edit urlfilters.json by setting ignoreOutsideHost to true and adding

{ "class": "com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter", "name": "RobotsFilter", "params": { } }

to the list of filters (don't forget to add a comma before this section!).

Next add https to the second line of regex-filter.txt, and finally set the content of regex-normalizers.xml to

<?xml version="1.0"?>

<regex-normalize>

<regex>

<pattern>\?.+</pattern>

<substitution></substitution>

</regex>

</regex-normalize>

Everything should now be similar to the configuration of last year's tutorial.

Customisation of the topology class

The example topology has a simple memory based Spout which reads preset URLs and a dummy StatusUpdater. What we need to do instead is to use the Redis-based equivalents.

Replace the spout declaration in CrawlTopology.java (line 49) with

builder.setSpout("spout", new com.digitalpebble.stormcrawler.redis.RedisSpout());

and StdOutStatusUpdater() (line 67) with com.digitalpebble.stormcrawler.redis.StatusUpdaterBolt().

For good measure let's get rid of StdOutIndexer() and use com.digitalpebble.stormcrawler.indexing.DummyIndexer() instead.

Run the crawl

First, let's build an uber-jar with

mvn clean package

Then as mentioned in the README.md we can start the crawl with :

storm jar target/redis-crawler-1.0.jar net.stormcrawler.CrawlTopology -conf crawler-conf.yaml -local

or without '-local' if Storm is running as a (pseudo?) distributed cluster and we want the benefits of the UI, proper logging etc...

You should see the usual log entries and metrics info scrolling on the console. For a better understanding of what's going on, open a console and type

redis-cli KEYS s_* | sort

to see all the URLs discovered during the crawl, regardless of whether they were fetched or not.

To see the entire content of the fetch queue, do :

redis-cli LRANGE q_www.tescobank.com 0 -1

whereas

redis-cli LLEN q_www.tescobank.com

returns its size only. When it returns 0, your crawl is finished and you can kill the process with CTRL-C or do it with STORM KILL if you are running in distributed mode.

Conclusion

Looking back at last year's post, I realised that StormCrawler has evolved a lot since and the previous instructions were not quite up to date. The archetype, for instance, is a good way of getting started with StormCrawler and provides a solid starting point. There are also a lot more useful resources that users can leverage, including our brand new Redis components.

The approach we used for Redis where 2 different sets of keys are used for the crawl frontier and the URLs status could be reused for other key value stores like HBase which do not necessarily have secondary indices that we can use for sharding the URLs per host and guarantee a good diversity of URLs in the topology.

Please do get involved and help reviewing the PR for Redis. If you have any questions or problems, see http://stormcrawler.net/support/.

Happy crawling!

PS: Exercise for the reader

You probably noticed the 'crawler.flux' file in the directory generated by the artefact. Flux is a recent addition to Apache Storm : instead of defining a topology via a Java class as we did above, Flux allows you to do define the components of the topology and their interactions via a yaml file in a language-neutral way. Better, it means that you don't need to recompile the code if you change something in your topology (you'll still need to turn it off and restart it though).

The crawler.flux file corresponds to the default topology which we modified above to use Redis. As stated in the README, you can start a topology in the following way :

storm jar target/redis-crawler-1.0.jar org.apache.storm.flux.Flux --local crawler.flux

As an exercise, why don't you have a look at the Flux file and modify it so that it runs the Redis-based topology?

PPS: Exercise for the reader #2

Crawling is very nice but unless you store or index the documents you crawl it remains a pretty pointless exercise. Why don't you modify the crawl above so that it sends the data to Elasticsearch, SOLR or Cloudsearch? If you are feeling adventurous you could also try the WARC module and generate some great web archives to play with.

Tuesday, 12 July 2016

StormCrawler : the Coming of Age

I am very happy to announce the release of StormCrawler 1.0. It has taken a few years (and more specifically 791 commits from 15 contributors and 10 releases) to evolve from what was just an intuition to a piece of software which is now mature, stable and used in production by various companies.

The major release number reflects the version of Apache Storm, as we switched from Storm 0.10 to 1.0, however our minor number will not necessarily track the one used in Storm. The move to 1.0 also reflects the maturity of StormCrawler.

The main changes compared to the previous release are :

Moved to Storm 1.x (#295)
Upgrade to Java 8 (#308)
Renamed packages storm.crawler into stormcrawler (#306)
Added Flux equivalent to the example topology class (#286)
FetcherBolt simplify access to OutputCollector (#278)
JSoupParser detects mimetype with Tika #303
Elasticsearch : remote TTL from metrics index (#296)
Elasticsearch : sampler aggregation spout (#305)
Tika : Provide clues to Tika parser for indentification of mimetype (#302)
Use metadata keys last-modified and etag (#109)
URLFilter based on metadata (#312)

plus several minor changes and bug fixes.

Let's have a closer look at some of the changes above.

Flux

Flux is a very elegant resource for defining and deploying topologies on Apache Storm. The simple crawl topology generated by the archetype now contains a Flux equivalent of the Java topology class. This means that you don't need to know Java to define a topology but also that you don't need to recompile the jar every time you make a small change to the topology.

After calling 'mvn clean package' , you can start the topology in local mode with

storm jar target/<INSERTJARNAMEHERE>.jar  org.apache.storm.flux.Flux --local crawler.flux

Sampler aggregation spout

We added a new type of spout to the Elasticsearch module which uses the sampler aggregation - a new feature in Elasticsearch 2.x. This spout is useful for cases where the status index is very large as it reduces the time taken by the queries while preserving the diversity of URLs.

URL filter based on metadata

A new configurable URL filter based on metadata had been added and is included in the default topology generated by the archetype. This filter is ridiculously simple : it removes any outlinks based on the metadata of the source document. Imagine for instance that we get URLs from sitemaps files for a given site. We could decide not to follow the outlinks found in the leafs documents from the sitemap, which is a reasonable thing to do : if a site tells you what to index, there is a possibility that you'd only get noise / variants / duplicates by following the outlinks. Since leaf documents get the feature isSitemap with the value of false, we can configure the URL filter as follows :


	{ "class": "com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter",
	"name": "MetadataFilter",
	"params": {
	"isSitemap": "false"
	}
	}

This mechanism can be used for other things of course.

What next?

The next release will probably contain code and resources for fetching with the Selenium protocol or JbrowserDriver (#144). We might also improve the WARC related code. As usual the project evolves with the needs and contributions of the community.

Thanks

Since StormCrawler just passed a major milestone, it is a good time to thank all the committers, contributors past and present and users for helping make the project what it is today. I've had some very positive feedback recently from new users and I hope some of you will take the time to share their experiences with the rest of the community.

Happy crawling!

Julien

Friday, 10 June 2016

What's new in StormCrawler 0.10?

The version 0.10 of Storm-Crawler has just been released. It contains many improvements and bugfixes and we recommend all existing users to upgrade to it.

Here are the main changes :

Core

Apart from the usual dependency upgrades (Apache Tika 1.12, Jsoup 1.8.3) and various bugfixes (notably #280 and 293), we completely removed the configuration files (parse and URL filters) from the core module (#227) as these should be specified by the user or provided by the archetype. This fixes an old issue we were having with user files being possibly overwritten with the ones provided by the core jar when generating the big jar.

We also added a mechanism to allow custom scheduling of the URLs based on metadata (#283). This applies to fetched documents only for the time being and can be used to revisit some pages more frequently depending on their nature, for instance news feeds (see below).

It is now possible to configure a list of metadata to persist (metadata.persist) into the status storage but not transfer to the outlinks of a document (#293). This is useful for the _redirTo values that we now generate to track the target of redirections (#96). Until now this would have been passed on to the outlinks, which would have been wrong.

The JSoupParser now has the option of passing on documents which it can't handle to the default stream so that another bolt can try to deal with them (#266). This can be useful e.g. to chain the JSoup parser - which generates a good DOM for XPath extraction, with the Tika one which gives rubbish DOM but can handle all sorts of file formats. JSoupParser also generates more complete DOMs, including inline Javascript and css nodes (#219).

New resources

LinkParseFilter takes Xpath expressions in the parsefilter configuration and allows to add the matching elements to the outlinks.

FeedParserBolt uses the ROME library to process news feeds and generate the outlinks accordingly.

There is a separate repository for resources to generate WARC files. These might be moved to the storm-crawler repository later on.

Elasticsearch

There has been loads of work done in this module. The main thing is that we upgraded the code to the current version 2.3.1 of Elasticsearch (#275), which should provide better performance and new functionalities. The Kibana dashboards have also been improved and can display generic metrics for Apache Storm (receive queues, memory heaps), top hosts/domains on status dashboard (see below) and total bytes fetched on the crawl metrics dashboard.

Kibana dashboard showing the status counts and top hostnames

What's next?

There is already a branch to move the code to Storm 1.x and some related improvements (#278). We will keep improving the existing components, in particular add sampling to the AggregationSpout for Elasticsearch.

Thanks to all users and contributors who helped with this release. Remember that you can follow the project on Twitter @stormcrawlerapi.

Wednesday, 16 March 2016

What's new in Storm-Crawler 0.9

The version 0.9 of Storm-Crawler has just been released. It contains many improvements and bugfixes and we recommend all existing users to upgrade to it.

Here are the main changes :

Core

Moved to Storm 0.10.0 #229
FetcherBolt can dump content of its queues to log #45
PluggableSchedulers #245
Bugfix HTTP protocol setConnectionRequestTimeout
Sitemap : option filter out URLs older than certain threshold #249
New URLFilter : remove links to self #252
HTTP protocol can limit amount of content fetched #206
AbstractStatusUpdaterBolt allows proper acking of tuples #241
Improvements to URL normalization #264 #205 #120
Improvements to Robots caching #265
Fetcher to dump the content of its queues to the log #45

Elasticsearch

Spout : one instance per shard #198
AggregationSpout #237
ES-based Spouts use a static Client instance #258
ES use NodeClient only if no address is specified #260
StatusUpdaterBolts must be able to ack/fail explicitly #216

The AggregationSpout in particular is a very useful feature. The existing ElasticSearchSpout offers very few guarantees regarding the diversity of hosts retrieved by the queries. This is improved by randomizing the results, however the latter has an impact on the memory used by ES for the field caching. Limiting the field caching leads to poor performance and the eviction mechanism slows the queries quite a lot. The AggregationSpout on the other hand guarantees a good diversity of URLs by bucketing the search results per hostname (or TLD or IP depending on the value of es.status.routing.fieldname). It is likely to get improved further once we move to Elasticsearch 2.x (see below).

Both ES spout implementations benefit from the sharding mechanism introduced in #198 : if the configuration specifies es.status.routing: true, the StatusUpdaterBolt will direct the URLs to specific shards based on the value of partition.url.mode, i.e. all the URLs for a particular host or TLD will be colocated on the same Elasticsearch shard. This means that the size of the shard can become uneven, depending on the distribution of URLs in your crawl but another implication is that it is now possible to have one Spout instance per shard and parallelise the reads from ES while preserving politeness.

Finally you should see a noticeable improvement in performance now that the StatusUpdaterBolt acks/fails the tuples explicitly #216. Prior to that all tuples were automatically acked as soon as they were buffered for indexing to ES i.e. they could get acked in the spout and removed from its internal cache even though the updates were not yet committed to the ES index. This meant that the spout could resend the same URL down the topology after querying ES for new URLs. Obviously quite wasteful but now luckily fixed!

What's next?

We should see plenty of further improvements in the next months, in particular an upgrade to Apache Storm 1.0 (which is due any time soon) and also a move to Elasticsearch 2.x (#257). Thanks to all users and contributors. Happy crawling!