DigitalPebble's Blog

Friday, 9 September 2016

Index the web with StormCrawler (revisited)

I hope you all had an enjoyable summer. I can't believe it's not even a year since I published the (relatively popular) post on Index the web with AWS CloudSearch! At the time we had just released the version 0.6 of StormCrawler and the post explained how to use SC to crawl a website and index it with CloudSearch. The tutorial also covered the same operations with Apache Nutch and helped users understand how the two projects differ.

StormCrawler has evolved a lot in just one year! In won't go into much details as this is explained in the previous posts but we are now at version 1.0, have a proper website for the project (albeit in constant need of improvements), a logo, many new resources, including a Maven archetype. The latest of these new resources is a new module for the popular Redis data structure store.

Last year's post is now quite outdated as a result so we'll now revisit the same use case (crawling http://www.tescobank.com/) but this time using Redis to store the crawl frontier and URL information and bootstrapping the project with the Maven archetype. This time, we won't index its content with Cloudsearch (or anything else) to keep the configuration to a minimum.

If you are new to StormCrawler, please read the Cloudsearch post for an introduction as well as the material on the website. Please bear in mind that although this short tutorial covers a single website processed with a single machine, StormCrawler is distributed by nature (thanks to Apache Storm) and can run on a cluster to deal with millions of pages.

Prerequisites

The instructions below are based on a Linux distribution. You will need to install the following software :

Java 8
Apache Maven (http://maven.apache.org/)
Redis 2.8.1 (http://redis.io/)
Apache Storm 1.0.1 (http://storm.apache.org/)

The Redis module is currently a PR with the code stored in a separate branch. This might be merged based on user feedback and be available in the next release but until then you can download the code for the redis branch or clone it with Git. Unzip the archive and from its root dir call 'mvn clean install', this will put all the necessary jars in your local repository.

The Storm command must be on your PATH, you don't need its servers to be running if you only want to run the crawls in local (non-distributed) mode.

Redis

Apache Storm as a framework is 'source agnostic' (if such a term exists), all it expects is that the Spout implementations provide the topologies with a steady stream of tuples. In the case of StormCrawler, these tuples are of the form <URL, Metadata>. Depending on the use case, they might come from a distributed queue (e.g. RabbitMQ), a database (MySQL) or a search system (Elasticsearch, SOLR).

The choice of tool to use depends on the following factors :

do you follow outlinks?
if so, is the crawl recursive i.e. can you get to the same URL via different pages?
do you need to revisit the pages?

StormCrawler provides a number of resources in its external plugins and so does Apache Storm itself.

Often the same data structure is used to both persist the information we have about the URLs and queue the URLs to be fetched (crawl frontier). With a key / (structured) value store like Redis we can use a slightly different strategy and separate the crawl frontier from the status of the URLs. For the frontier, we use keys with the prefix 'q_' followed by the host or domain name with a List of URLs to fetch as value. The Spout iterates on the queue entries and removes the head of the queue to send them as Tuples in the topology. This has the advantage of guaranteeing a perfect diversity of URLs in the topology and hence optimal performance. We also the information about the URLs with the prefix 's_' followed by the URL, the value associated with such keys is a String containing the status and metadata of the URLs. When discovering new URLs in recursive crawls, we can check whether the URL is already known, in which case we won't add it to the queues again.

One limitation of our Redis spout and updater is that we can't reschedule URLs for revisiting them but for many use cases, this is absolutely fine.

Let's get started. With the Redis server running, open a client session with redis-cli and type

FLUSHALL RPUSH q_www.tescobank.com http://www.tescobank.com/

We'll skip the creation of the corresponding s_http://www.tescobank.com/ entry out of pure laziness. It will get created once the URL is fetched and its status updated. What we just did is that we created a queue for the host tescobank.com with a list as value which contains a single URL to fetch.

Boostrap a project with an archetype

Instead of having to build everything from scratch we'll use our Maven archetype to bootstrap our crawl project. From anywhere you want on your filesystem do :

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.1-SNAPSHOT -DgroupId=net.stormcrawler -DartifactId=redis-crawler -Dversion=1.0

and press enter to confirm. Change the directory to redis-crawler, you should see a basic set of config and resource files.

├── crawler-conf.yaml ├── crawler.flux ├── pom.xml ├── README.md └── src └── main ├── java │ └── net │ └── stormcrawler │ └── CrawlTopology.java └── resources ├── default-regex-filters.txt ├── default-regex-normalizers.xml ├── parsefilters.json └── urlfilters.json

This will be the starting point for our modifications.

Customisation of resources

Edit the pom.xml file and add the redis module to the list of dependencies

<dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-redis</artifactId> <version>1.1-SNAPSHOT</version> </dependency>

Next, we'll edit crawler-conf.yaml and add the following configuration parameters :

http.content.limit: -1

fetcher.server.delay: 2.0

redis.status.max.urls.per.bucket: 5

Let's now edit urlfilters.json by setting ignoreOutsideHost to true and adding

{ "class": "com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter", "name": "RobotsFilter", "params": { } }

to the list of filters (don't forget to add a comma before this section!).

Next add https to the second line of regex-filter.txt, and finally set the content of regex-normalizers.xml to

<?xml version="1.0"?>

<regex-normalize>

<regex>

<pattern>\?.+</pattern>

<substitution></substitution>

</regex>

</regex-normalize>

Everything should now be similar to the configuration of last year's tutorial.

Customisation of the topology class

The example topology has a simple memory based Spout which reads preset URLs and a dummy StatusUpdater. What we need to do instead is to use the Redis-based equivalents.

Replace the spout declaration in CrawlTopology.java (line 49) with

builder.setSpout("spout", new com.digitalpebble.stormcrawler.redis.RedisSpout());

and StdOutStatusUpdater() (line 67) with com.digitalpebble.stormcrawler.redis.StatusUpdaterBolt().

For good measure let's get rid of StdOutIndexer() and use com.digitalpebble.stormcrawler.indexing.DummyIndexer() instead.

Run the crawl

First, let's build an uber-jar with

mvn clean package

Then as mentioned in the README.md we can start the crawl with :

storm jar target/redis-crawler-1.0.jar net.stormcrawler.CrawlTopology -conf crawler-conf.yaml -local

or without '-local' if Storm is running as a (pseudo?) distributed cluster and we want the benefits of the UI, proper logging etc...

You should see the usual log entries and metrics info scrolling on the console. For a better understanding of what's going on, open a console and type

redis-cli KEYS s_* | sort

to see all the URLs discovered during the crawl, regardless of whether they were fetched or not.

To see the entire content of the fetch queue, do :

redis-cli LRANGE q_www.tescobank.com 0 -1

whereas

redis-cli LLEN q_www.tescobank.com

returns its size only. When it returns 0, your crawl is finished and you can kill the process with CTRL-C or do it with STORM KILL if you are running in distributed mode.

Conclusion

Looking back at last year's post, I realised that StormCrawler has evolved a lot since and the previous instructions were not quite up to date. The archetype, for instance, is a good way of getting started with StormCrawler and provides a solid starting point. There are also a lot more useful resources that users can leverage, including our brand new Redis components.

The approach we used for Redis where 2 different sets of keys are used for the crawl frontier and the URLs status could be reused for other key value stores like HBase which do not necessarily have secondary indices that we can use for sharding the URLs per host and guarantee a good diversity of URLs in the topology.

Please do get involved and help reviewing the PR for Redis. If you have any questions or problems, see http://stormcrawler.net/support/.

Happy crawling!

PS: Exercise for the reader

You probably noticed the 'crawler.flux' file in the directory generated by the artefact. Flux is a recent addition to Apache Storm : instead of defining a topology via a Java class as we did above, Flux allows you to do define the components of the topology and their interactions via a yaml file in a language-neutral way. Better, it means that you don't need to recompile the code if you change something in your topology (you'll still need to turn it off and restart it though).

The crawler.flux file corresponds to the default topology which we modified above to use Redis. As stated in the README, you can start a topology in the following way :

storm jar target/redis-crawler-1.0.jar org.apache.storm.flux.Flux --local crawler.flux

As an exercise, why don't you have a look at the Flux file and modify it so that it runs the Redis-based topology?

PPS: Exercise for the reader #2

Crawling is very nice but unless you store or index the documents you crawl it remains a pretty pointless exercise. Why don't you modify the crawl above so that it sends the data to Elasticsearch, SOLR or Cloudsearch? If you are feeling adventurous you could also try the WARC module and generate some great web archives to play with.

Tuesday, 12 July 2016

StormCrawler : the Coming of Age

I am very happy to announce the release of StormCrawler 1.0. It has taken a few years (and more specifically 791 commits from 15 contributors and 10 releases) to evolve from what was just an intuition to a piece of software which is now mature, stable and used in production by various companies.

The major release number reflects the version of Apache Storm, as we switched from Storm 0.10 to 1.0, however our minor number will not necessarily track the one used in Storm. The move to 1.0 also reflects the maturity of StormCrawler.

The main changes compared to the previous release are :

Moved to Storm 1.x (#295)
Upgrade to Java 8 (#308)
Renamed packages storm.crawler into stormcrawler (#306)
Added Flux equivalent to the example topology class (#286)
FetcherBolt simplify access to OutputCollector (#278)
JSoupParser detects mimetype with Tika #303
Elasticsearch : remote TTL from metrics index (#296)
Elasticsearch : sampler aggregation spout (#305)
Tika : Provide clues to Tika parser for indentification of mimetype (#302)
Use metadata keys last-modified and etag (#109)
URLFilter based on metadata (#312)

plus several minor changes and bug fixes.

Let's have a closer look at some of the changes above.

Flux

Flux is a very elegant resource for defining and deploying topologies on Apache Storm. The simple crawl topology generated by the archetype now contains a Flux equivalent of the Java topology class. This means that you don't need to know Java to define a topology but also that you don't need to recompile the jar every time you make a small change to the topology.

After calling 'mvn clean package' , you can start the topology in local mode with

storm jar target/<INSERTJARNAMEHERE>.jar  org.apache.storm.flux.Flux --local crawler.flux

Sampler aggregation spout

We added a new type of spout to the Elasticsearch module which uses the sampler aggregation - a new feature in Elasticsearch 2.x. This spout is useful for cases where the status index is very large as it reduces the time taken by the queries while preserving the diversity of URLs.

URL filter based on metadata

A new configurable URL filter based on metadata had been added and is included in the default topology generated by the archetype. This filter is ridiculously simple : it removes any outlinks based on the metadata of the source document. Imagine for instance that we get URLs from sitemaps files for a given site. We could decide not to follow the outlinks found in the leafs documents from the sitemap, which is a reasonable thing to do : if a site tells you what to index, there is a possibility that you'd only get noise / variants / duplicates by following the outlinks. Since leaf documents get the feature isSitemap with the value of false, we can configure the URL filter as follows :


	{ "class": "com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter",
	"name": "MetadataFilter",
	"params": {
	"isSitemap": "false"
	}
	}

This mechanism can be used for other things of course.

What next?

The next release will probably contain code and resources for fetching with the Selenium protocol or JbrowserDriver (#144). We might also improve the WARC related code. As usual the project evolves with the needs and contributions of the community.

Thanks

Since StormCrawler just passed a major milestone, it is a good time to thank all the committers, contributors past and present and users for helping make the project what it is today. I've had some very positive feedback recently from new users and I hope some of you will take the time to share their experiences with the rest of the community.

Happy crawling!

Julien

Friday, 10 June 2016

What's new in StormCrawler 0.10?

The version 0.10 of Storm-Crawler has just been released. It contains many improvements and bugfixes and we recommend all existing users to upgrade to it.

Here are the main changes :

Core

Apart from the usual dependency upgrades (Apache Tika 1.12, Jsoup 1.8.3) and various bugfixes (notably #280 and 293), we completely removed the configuration files (parse and URL filters) from the core module (#227) as these should be specified by the user or provided by the archetype. This fixes an old issue we were having with user files being possibly overwritten with the ones provided by the core jar when generating the big jar.

We also added a mechanism to allow custom scheduling of the URLs based on metadata (#283). This applies to fetched documents only for the time being and can be used to revisit some pages more frequently depending on their nature, for instance news feeds (see below).

It is now possible to configure a list of metadata to persist (metadata.persist) into the status storage but not transfer to the outlinks of a document (#293). This is useful for the _redirTo values that we now generate to track the target of redirections (#96). Until now this would have been passed on to the outlinks, which would have been wrong.

The JSoupParser now has the option of passing on documents which it can't handle to the default stream so that another bolt can try to deal with them (#266). This can be useful e.g. to chain the JSoup parser - which generates a good DOM for XPath extraction, with the Tika one which gives rubbish DOM but can handle all sorts of file formats. JSoupParser also generates more complete DOMs, including inline Javascript and css nodes (#219).

New resources

LinkParseFilter takes Xpath expressions in the parsefilter configuration and allows to add the matching elements to the outlinks.

FeedParserBolt uses the ROME library to process news feeds and generate the outlinks accordingly.

There is a separate repository for resources to generate WARC files. These might be moved to the storm-crawler repository later on.

Elasticsearch

There has been loads of work done in this module. The main thing is that we upgraded the code to the current version 2.3.1 of Elasticsearch (#275), which should provide better performance and new functionalities. The Kibana dashboards have also been improved and can display generic metrics for Apache Storm (receive queues, memory heaps), top hosts/domains on status dashboard (see below) and total bytes fetched on the crawl metrics dashboard.

Kibana dashboard showing the status counts and top hostnames

What's next?

There is already a branch to move the code to Storm 1.x and some related improvements (#278). We will keep improving the existing components, in particular add sampling to the AggregationSpout for Elasticsearch.

Thanks to all users and contributors who helped with this release. Remember that you can follow the project on Twitter @stormcrawlerapi.

Wednesday, 16 March 2016

What's new in Storm-Crawler 0.9

The version 0.9 of Storm-Crawler has just been released. It contains many improvements and bugfixes and we recommend all existing users to upgrade to it.

Here are the main changes :

Core

Moved to Storm 0.10.0 #229
FetcherBolt can dump content of its queues to log #45
PluggableSchedulers #245
Bugfix HTTP protocol setConnectionRequestTimeout
Sitemap : option filter out URLs older than certain threshold #249
New URLFilter : remove links to self #252
HTTP protocol can limit amount of content fetched #206
AbstractStatusUpdaterBolt allows proper acking of tuples #241
Improvements to URL normalization #264 #205 #120
Improvements to Robots caching #265
Fetcher to dump the content of its queues to the log #45

Elasticsearch

Spout : one instance per shard #198
AggregationSpout #237
ES-based Spouts use a static Client instance #258
ES use NodeClient only if no address is specified #260
StatusUpdaterBolts must be able to ack/fail explicitly #216

The AggregationSpout in particular is a very useful feature. The existing ElasticSearchSpout offers very few guarantees regarding the diversity of hosts retrieved by the queries. This is improved by randomizing the results, however the latter has an impact on the memory used by ES for the field caching. Limiting the field caching leads to poor performance and the eviction mechanism slows the queries quite a lot. The AggregationSpout on the other hand guarantees a good diversity of URLs by bucketing the search results per hostname (or TLD or IP depending on the value of es.status.routing.fieldname). It is likely to get improved further once we move to Elasticsearch 2.x (see below).

Both ES spout implementations benefit from the sharding mechanism introduced in #198 : if the configuration specifies es.status.routing: true, the StatusUpdaterBolt will direct the URLs to specific shards based on the value of partition.url.mode, i.e. all the URLs for a particular host or TLD will be colocated on the same Elasticsearch shard. This means that the size of the shard can become uneven, depending on the distribution of URLs in your crawl but another implication is that it is now possible to have one Spout instance per shard and parallelise the reads from ES while preserving politeness.

Finally you should see a noticeable improvement in performance now that the StatusUpdaterBolt acks/fails the tuples explicitly #216. Prior to that all tuples were automatically acked as soon as they were buffered for indexing to ES i.e. they could get acked in the spout and removed from its internal cache even though the updates were not yet committed to the ES index. This meant that the spout could resend the same URL down the topology after querying ES for new URLs. Obviously quite wasteful but now luckily fixed!

What's next?

We should see plenty of further improvements in the next months, in particular an upgrade to Apache Storm 1.0 (which is due any time soon) and also a move to Elasticsearch 2.x (#257). Thanks to all users and contributors. Happy crawling!

Friday, 8 January 2016

What's new in Storm-Crawler 0.8

There has been quite a bit happening with Storm-crawler recently. We got a proper logo (see above) and the project has a website at http://stormcrawler.net. We also just released the version 0.8, which contains the following changes :

the groupId for the Maven artefacts is now com.digitalpebble.stormcrawler (#218)
[ES] Deactivate _all field for status and metrics indices (#228)
[SOLR] Check that not more than one instance of the Spout exists (#213)
Upgraded to storm 0.9.6
Discover sitemap files automatically from robots.txt (#211)
Use Travis-CI to check commits and PRs
Replaced RandomURLSpout with MemorySpout + added MemoryStatusUpdater (#224)
Maven archetype (#225)

The latter is the main attraction for this new release. It allows users to bootstrap a new storm-crawler based project using the following command :

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=0.8

in interactive mode where users can then specify the groupId, artifactId, version and package name to use for their new project.

This results in a fully formed project, complete with a Maven pom file, the default CrawlTopology currently in the core module as well as a README, a crawler-conf.yaml and a set of resources (url and parse filters). This project can then be compiled and run in the usual ways.

Having this archetype should help new users to get a better understanding of how storm-crawler works and will also simplify the code if we remove the resources used in the archetype from the core module (see #227). This also illustrates nicely how lightweight stormcrawler is, most people will probably stop cloning the repository and just use the dependencies instead.

We should see plenty of further improvements in the next months, in particular an upgrade to Storm 0.10 (#229) and of course more content on our brand new website.

Thanks to all users and contributors. Happy crawling!

Wednesday, 4 November 2015

What's new in Storm-Crawler 0.7

Storm-Crawler 0.7 has been released yesterday. This release fixes some bugs and provides numerous improvements, we advise users to upgrade to it. Here are the main changes:

AbstractIndexingBolt to use status stream in declareOutputFields #190

Change Status to ERROR when FETCH_ERROR above threshold #202

FetcherBolt tracks cause of error in metadata

Add default config file in resources #193

FileSpout chokes on very large files #196

Use Maven-Shade everywhere #199

Ack tick tuples #194

Remove PrinterBolt and IndexerBolt, added StdOutStatusUpdater #187

Upgraded Tika to 1.11

This release contains many improvements to the Elasticsearch module :

Added README with a getting started section

IndexerBolt uses url as doc ID

ESSpout : maxSecSinceQueriedDate param to avoid deep paging

ElasticSearchSpout can random sort -> better diversity of URLs

ElasticSearchSpout implements de/activate, counter for time spent querying, configurable result size

Simple Kibana dashboards for metrics and status indices

Metadata as structured object. Implements #197

ES Spout - more metrics acked, failed, es queries and docs

ESSeedInjector topology

Index init script uses ttl for metrics

Upgraded ES version to 1.7.2

The SOLR module has also received some attention :

solr-metadata #210

Cleaning some documentation and typo issues

Remove outdated configuration options for solr module

We also improved the metrics by adding a PerSecondReducer (#209) which is used by the FetcherBolts to provide page and byte per second metrics. The metrics names and codes got also improved - notably the gauges for ESSpout and FetcherBolt.

These changes combined with the Kibana dashboard templates make it easy to monitor a crawl and get addition insights into its behaviour, as illustrated below.

Of course thanks to Storm's pluggable and versatile metrics mechanism, it is relatively easy to send metrics to other backends such as AWS Cloudwatch for instance.

Thanks to the various users and contributors who helped with this release.