DigitalPebble's Blog

Tuesday 28 November 2017

What's new in StormCrawler 1.7

Amazingly this is the 20th release of StormCrawler! Here are the main changes:

Dependencies updates

crawler-commons 0.9 #513

Core

(bugfix) ParserBolts should use outlinks from parsefilters #498
LD_JSON parsefilter #501
okhttp : store request and response headers verbatim in metadata #506
(bugfix) okhttp protocol does not store headers in metadata #507
HTTP clients should handle http.accept.language and http.accept #499
Selenium protocol follows redirections #514
RemoteDriverProtocol needs multiple instances #505
SitemapParserBolt should force mime-type based on the clue #515

Elasticsearch

ES Spout : define filter query via config #502
Upgrade to ES 6.0 #517

We recommend all users to move to this version. If you wish to remain on an older version of Elasticsearch, you can simply keep your existing version of the stormcrawler elasticsearch module while upgrading stormcrawler core.

This version improves the processing of sitemaps, via #515 and the use of the crawler-commons 0.9 where we fixed the SAX parsing and extended its coverage. We also added improvements to our okhttp-based protocol implementation. If your crawl is a wide one with potentially any sort of content then you should go for okhttp over the default httpclient one. See our comparison of protocol implementations on the WIKI.

Finally, if you want to extract semantic data represented in ld-json then you'll love #501.

As usual, thanks to all contributors and users. Happy crawling!

Friday 8 September 2017

What's new in StormCrawler 1.6

Dependencies updates

jsoup 1.10.3
crawler-commons 0.8

Core

Use ISO representation of time for modifiedtime in adaptivescheduler #496
Use ISO representation of time for discoveryDate and lastProcessedDate, #477
Improved Charset Detection #495
SitemapParserBolt configure use SAX or not
SitemapParserBolt generates metrics for average processing time
HTTP protocol based on OKHTTP #484
Apache Http client can use HEAD method on a per URL basis #485
ContentFilter to leave trace of the pattern that matched #480
Metadata has a new public method for getting first non-empty value from a set of keys
Added ARTICLE to patterns for content filter

LangID

Can add more than one lang code based on configurable prob threshold. #481

WARC

Added rotation policy based on time and filesize

ES: added es.status.reset.fetchdate.after #478
Removed Grafana resources - can be downloaded from Grafana portal

Monday 29 May 2017

What's new in StormCrawler 1.5

StormCrawler 1.5 has just been released! It is an important road mark with the move to Elasticsearch 5.x and the implementation of long-awaited features such as the Selenium-based protocol. The code has been improved in many ways and despite the seemingly low number of lines below, this new release is a mammoth one!

The project, in general, is in very good health, with more and more organisations using it in production, and an increased visibility, reflected by the growing number of questions on StackOverflow.

Here are the main changes in 1.5.

CORE DEPENDENCIES UPGRADES

Apache Storm 1.1.0 (#450)

CORE MODULE

HTTP Protocol: implement cookie handling (#32)
java.util.zip.ZipException: Not in GZIP format thrown on redirs with httpclient (#455)
Selenium-based protocol implementation (#144) which I described in a separate blog post
Indicate whether RobotsRules come from cache or have been fetched (#460)
Memory issues when ByteArrayBuffer gets instantiated with a large value despite maxLength being set (#462)
FetcherBolt to dump URLs being fetched to log (#464)
Override sitemapsAutoDiscovery settings per URL (#469)

Knowing whether RobotsRules come from the cache gives us more insights into the behaviour of the crawlers as we can display the ratio of cache vs live (see illustration below)

as well as pages fetched vs robots fetched.

ELASTICSEARCH

Utility class to export URL and metadata from ES index to file (#444)
Fixed sampling with aggregation spout in ES5
Upgrade to Elasticsearch 5.3 (#221 and #451)
Optimise nextFetchDate to speed up queries to Elasticsearch (#429 and #452)
Delete gone pages from index (#253)
metrics - remove filtering (#281)

One of the main changes related to Elasticsearch is the removal of ElasticsearchSpout and the introduction of CollapsingSpout, which uses the brand new FieldCollapsing in Elasticsearch. We also fixed a concurrency issue in the StatusUpdaterBolt (9fefac8), improved the efficiency of the spouts by getting them to process results in a separate thread (1b0fb42), which combined with the optimisation of nextFetchDate (see above) and the fix of the sampling in AggregationSpout, means that the Elasticsearch module is more efficient than ever.

The move to Elasticsearch 5.x was not without difficulties but the result justifies the effort. I described in a separate post the common pitfalls of upgrading an existing topology to Elasticsearch 5.

Coming next?

As usual, it is hard to guess what the next release will be made of as the project is driven by its community.

Having said that, I'd expect the Selenium-based protocol to get improved as users start to use it. It is also likely that we'll move away from Apache HttpClient library (#443). As mentioned in the previous release, we'll probably upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser.

In the meantime and as usual, thanks to all contributors and users and happy crawling!

Monday 15 May 2017

Avoid these common pitfalls when upgrading StormCrawler with Elasticsearch 5.x

The next (and probably imminent) release of StormCrawler will contain an update of Elasticsearch to version 5.3. This is definitely a good thing, as we want to keep up with the latest versions of Elasticsearch but has a few pitfalls when upgrading your existing application. Some of the changes are documented in the README but I will reiterate them here, just in case.

LOG4J dependencies

ES5 requires an upgrade in the logging dependencies of Apache Storm. You can update the dependencies in your existing Storm cluster by hand but since my patch is part of Storm 1.1.0, you should probably upgrade Storm altogether. StormCrawler 1.5 will depend on Storm 1.1.0 (but probably works with older versions as well).

Maven Shade Configuration

The pom file of your StormCrawler-based project needs modifying as well, you'll need to specify the Maven Shade Configuration and include:

<manifestEntries>
 <Change></Change>
 <Build-Date></Build-Date>
</manifestEntries>

See https://github.com/elastic/elasticsearch/issues/21627; this wasn't an issue with the previous versions of Elasticsearch.

Update es-conf.yaml

In particular, the value of es.status.bucket.field used to be _routing, which is an automatically generated field, however this is not available for the spouts anymore. Instead, use the same value as es.status.routing.fieldname e.g. metadata.hostname.

Mapping

ES5 should be able to read your existing indices, however, if you create a new set of indices from scratch, make sure you use the latest version of the script.

I hope this will help you for a successful upgrade, I will cover the new functionalities and improvements coming with StormCrawler 1.5 when it is released.

Happy crawling

Thursday 27 April 2017

Crawl dynamic content with Selenium and StormCrawler

Many websites rely on AJAX to provide smooth and reactive web applications and/or single page websites. While this works fine for humans using modern browsers, this is often challenging for robots as they can’t interpret the Javascript and usually rely on low-level HTTP protocol implementations to get the binary content. Even Google have announced only as recently as October 2015 that their crawlers can handle dynamic content, even though tests have shown that this is still far from being perfect.

Support for dynamic content is something that many users have asked for in StormCrawler and I am pleased to announce that we have recently committed code for this. The next release of StormCrawler (1.5) will contain a Selenium WebDriver-based protocol implementation so let’s have a sneak preview of how to use it and what it can do for you.

Prerequisites

The instructions below are based on Linux commands. You will need to install Java 8 and Maven to compile StormCrawler as well as PhantomJS (2.1.1 or above), which we will connect to via WebDriver. You might want to install Apache Storm, even though this is not a strict requirement as we’ll see below.

Until StormCrawler 1.5 is released, you will need to get the master branch, either with Git or by downloading the code from https://github.com/DigitalPebble/storm-crawler. Once this is done, cd to storm-crawler and run `mvn clean install`. This should put the storm-crawler artefacts in your local Maven repository, ready to use for the next step. This won’t be needed once 1.5 is released and you will be able to get the artefacts straight from Maven Central.

Simple example

Let’s first build a StormCrawler project using the Maven archetype:

mvn archetype:generate -B -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.5 -DgroupId=com.digitalpebble.crawl -DartifactId=selenium-tutorial -Dversion=1.0-SNAPSHOT -Dpackage=com.digitalpebble.crawl

This will give you a basic set of resources and configuration for StormCrawler. Go to the selenium-tutorial directory and build the uber jar with `mvn clean package`. We are now ready to go with a simple example.

Edit the file crawler.flux and set https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing as value for the constructorArgs in the spout config as shown below:

If you look at the source of that page, you’ll see that it consists mostly of Javascript. Fine for our browsers, but how does StormCrawler fare on it? With Storm installed and accessible on the command line, let’s do

storm jar target/selenium-tutorial-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 60000

This will start the topology defined in the Flux file and let it run for one minute.

Note: the command above assumes that you have installed Storm. Alternatively, you can run the code directly with Maven like so:

mvn clean compile exec:java -Dexec.mainClass=org.apache.storm.flux.Flux -Dexec.args="--local crawler.flux --sleep 60000"

The console will display a lot of logs about the components being initialised but also the status of the URLs (e.g. FETCHED, DISCOVERED, etc...), the fields extracted from the documents fetched and various metrics. To remove the latter, you can comment out the section topology.metrics.consumer.register in crawler-conf.yaml.

Tip: if you are feeling adventurous, have a look at the other entries from the conf files e.g. remove domain=domain from indexer.md.mapping and see how that affects the output below.

Regardless of whether you ran the topology using Storm or Maven, you should see an output similar to this:

content

url https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing

domain dagbladet.no

description Bakte poteter blir like gode når de bakes i ovnen uten folie rundt.

title Dagbladet Mat

https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing FETCHED Thu Apr 27 14:46:59 BST 2017

The first 5 lines were generated by the StdOutIndexer and as we can see, no text content was generated at all, the title is a generic one and no other fields could be extracted. Further down, a single line was generated by the StdOutStatusUpdater, indicating that the URL was successfully fetched, however, no outlinks were discovered at all (we would have seen lines with a DISCOVERED status).

Selenium to the rescue

Time to put our brand new protocol implementation to use. Edit the file crawler-conf.yaml and add

http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

selenium.addresses: "http://localhost:9515"

This tells StormCrawler to use the custom protocol implementations and connect to a WebDriver server on port 9515.

Open a different console and run `phantomjs --webdriver 9515` then run the topology again and look at the output

content 2873 chars

url https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing

keywords mat,oppskrift,kokker,råvarer,ingredienser,bakt,potet,med,rømme-,og,blåmuggostdressing

domain dagbladet.no

description Bakte poteter blir like gode når de bakes i ovnen uten folie rundt.

title Bakt potet med rømme- og blåmuggostdressing - Oppskrift | Dagbladet Mat

This time we got some textual content, the correct title and were able to extract keywords. As you’ve certainly noticed, we got all sorts of outlinks, similar to what we can observe with a browser.

What happened under the bonnet is that PhantomJS gave us a fully interpreted HTML page, on which we ran our JSoup parser. The latter used the ParseFilters defined in src/main/resources/parsefilters.json to extract the metadata displayed by the indexer later on (i.e. title, description, domain, keywords, canonical).

Let’s now look at a slightly more complex scenario.

NavigationFilters

Websites often use Javascript for interactions within a page and navigation through the content. If we look at https://rn12.ultipro.com/SOU1022/JobBoard/ListJobs.aspx for instance, we can see that the pagination for the result lists is done in Javascript. Assuming that we want to extract all the jobs listed for that board, we would be able to get the links from the initial page with the simple HTTP protocol implementation but not the links to the following result pages as they are handled with AJAX.

Luckily, we can implement the navigation logic by implementing a class extending NavigationFilter. First, let’s create a new file JobBoardNavigationFilter.java in src/main/java/com/digitalpebble/crawl and fill it with the content below

Tip: wget "https://s.apache.org/mOkz" -O src/main/java/com/digitalpebble/crawl/JobBoardNavigationFilter.java

The approach used here it to generate a dummy HTML content and create links for all the job pages, while iterating on the result pages. This class gets called by the Selenium-based protocol implementation.

Now, let’s create a new file navigationfilters.json in the directory resources and give it the following content

{

"com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters": [

{

"class": "com.digitalpebble.crawl.JobBoardNavigationFilter",

"name": "JobBoard"

}

]

}

Finally, we specify the name of the file we just created in the config with

navigationfilters.config.file: navigationfilters.json

Don’t forget to recompile the code with `mvn clean package` before launching the crawl. This time we’ll just check that we get all the links to the job pages in one go.

storm jar target/selenium-tutorial-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 60000 | grep DISCOVERED

Note: why not download chromedriver and use it instead of PhantomJS? By default, chromedriver does not run in headless mode so you could see the browser being driven by the navigation filter, including the stuff you usually don’t notice, like the robots.txt file being fetched.

Conclusion

The resources covered here are the very first step towards making StormCrawler handle dynamic content and there is much work to do on improving it, however, the brand new protocol based on Selenium should already be a useful starting point. I hope you'll give it a try, happy crawling!

Tuesday 4 April 2017

Video Tutorial - StormCrawler + Elasticsearch + Kibana

This tutorial explains how to configure Elasticsearch with StormCrawler.

We first bootstrap a StormCrawler project using the Maven archetype, have a look at the resources and code generated, then modify the project so that it uses Elasticsearch. We then run an injection topology and the crawl topology before setting up Kibana for monitoring the metrics and content of the status index.

(with my apologies for the quality of the sound)

Enjoy

Julien

Wednesday 29 March 2017

Full day workshop(s) on StormCrawler (+Elasticsearch and Kibana)

I will be running a full-day workshop on crawling with StormCrawler on the 24th April in Berlin. See full details on https://endoctus.com/course/web-crawling-with-stormcrawler.

Please find the program below:

In this workshop, we will explore StormCrawler a collection of resources for building low-latency, large scale web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what Storm-Crawler provides, we'll put it to use straight away for a simple crawl before moving on to the deployed mode of Storm.

In the second part of the session, we will then introduce metrics and index documents with Elasticsearch and Kibana and dive into data extraction. Finally, we'll cover recursive crawls and scalability. This course will be hands-on: attendees will run the code on their own machines.

This course will suit Java developers with an interest in big data, stream processing, web crawling and search. It will provide a practical introduction to both Apache Storm and Elasticsearch as well of course as StormCrawler and should not require advanced programming skills.

Duration : 2x3 hours

PS: Do you follow DigitalPebble or StormCrawler on Twitter? Announcements and updates are made there (as well as all sorts of interesting news of course!)