Thursday, 27 April 2017

Crawl dynamic content with Selenium and StormCrawler

Many websites rely on AJAX to provide smooth and reactive web applications and/or single page websites. While this works fine for humans using modern browsers, this is often challenging for robots as they can’t interpret the Javascript and usually rely on low-level HTTP protocol implementations to get the binary content. Even Google have announced only as recently as October 2015 that their crawlers can handle dynamic content, even though tests have shown that this is still far from being perfect.

Support for dynamic content is something that many users have asked for in StormCrawler and I am pleased to announce that we have recently committed code for this. The next release of StormCrawler (1.5) will contain a Selenium WebDriver-based protocol implementation so let’s have a sneak preview of how to use it and what it can do for you.

Prerequisites

The instructions below are based on Linux commands. You will need to install Java 8 and Maven to compile StormCrawler as well as PhantomJS (2.1.1 or above), which we will connect to via WebDriver. You might want to install Apache Storm, even though this is not a strict requirement as we’ll see below.
Until StormCrawler 1.5 is released, you will need to get the master branch, either with Git or by downloading the code from https://github.com/DigitalPebble/storm-crawler. Once this is done, cd to storm-crawler and run `mvn clean install`. This should put the storm-crawler artefacts in your local Maven repository, ready to use for the next step. This won’t be needed once 1.5 is released and you will be able to get the artefacts straight from Maven Central.

Simple example

Let’s first build a StormCrawler project using the Maven archetype:

mvn archetype:generate -B -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.5 -DgroupId=com.digitalpebble.crawl -DartifactId=selenium-tutorial -Dversion=1.0-SNAPSHOT -Dpackage=com.digitalpebble.crawl


This will give you a basic set of resources and configuration for StormCrawler.  Go to the selenium-tutorial directory and build the uber jar with `mvn clean package`. We are now ready to go with a simple example.

Edit the file crawler.flux and set https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing as value for the constructorArgs in the spout config as shown below:



If you look at the source of that page, you’ll see that it consists mostly of Javascript. Fine for our browsers, but how does StormCrawler fare on it? With Storm installed and accessible on the command line, let’s do


storm jar target/selenium-tutorial-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 60000


This will start the topology defined in the Flux file and let it run for one minute.


Note: the command above assumes that you have installed Storm. Alternatively, you can run the code directly with Maven like so:
mvn clean compile exec:java -Dexec.mainClass=org.apache.storm.flux.Flux -Dexec.args="--local crawler.flux --sleep 60000"


The console will display a lot of logs about the components being initialised but also the status of the URLs (e.g. FETCHED, DISCOVERED, etc...), the fields extracted from the documents fetched and various metrics. To remove the latter, you can comment out the section topology.metrics.consumer.register in crawler-conf.yaml.


Tip: if you are feeling adventurous, have a look at the other entries from the conf files e.g. remove domain=domain from indexer.md.mapping and see how that affects the output below.


Regardless of whether you ran the topology using Storm or Maven, you should see an output similar to this:


content
url https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing
domain dagbladet.no
description Bakte poteter blir like gode når de bakes i ovnen uten folie rundt.
title Dagbladet Mat


https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing FETCHED Thu Apr 27 14:46:59 BST 2017


The first 5 lines were generated by the StdOutIndexer and as we can see, no text content was generated at all, the title is a generic one and no other fields could be extracted. Further down, a single line was generated by the StdOutStatusUpdater, indicating that the URL was successfully fetched, however, no outlinks were discovered at all (we would have seen lines with a DISCOVERED status).


Selenium to the rescue


Time to put our brand new protocol implementation to use. Edit the file crawler-conf.yaml and add


 http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
 https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
 selenium.addresses: "http://localhost:9515"


This tells StormCrawler to use the custom protocol implementations and connect to a WebDriver server on port 9515.


Open a different console and run `phantomjs --webdriver 9515` then run the topology again and look at the output


content 2873 chars
url https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing
keywords mat,oppskrift,kokker,råvarer,ingredienser,bakt,potet,med,rømme-,og,blåmuggostdressing
domain dagbladet.no
description Bakte poteter blir like gode når de bakes i ovnen uten folie rundt.
title Bakt potet med rømme- og blåmuggostdressing - Oppskrift | Dagbladet Mat


This time we got some textual content, the correct title and were able to extract keywords. As you’ve certainly noticed, we got all sorts of outlinks, similar to what we can observe with a browser.


What happened under the bonnet is that PhantomJS gave us a fully interpreted HTML page, on which we ran our JSoup parser. The latter used the ParseFilters defined in src/main/resources/parsefilters.json to extract the metadata displayed by the indexer later on (i.e. title, description, domain, keywords, canonical).


Let’s now look at a slightly more complex scenario.


NavigationFilters


Websites often use Javascript for interactions within a page and navigation through the content. If we look at https://rn12.ultipro.com/SOU1022/JobBoard/ListJobs.aspx for instance, we can see that the pagination for the result lists is done in Javascript. Assuming that we want to extract all the jobs listed for that board, we would be able to get the links from the initial page with the simple HTTP protocol implementation but not the links to the following result pages as they are handled with AJAX.

Luckily, we can implement the navigation logic by implementing a class extending NavigationFilter. First, let’s create a new file JobBoardNavigationFilter.java in src/main/java/com/digitalpebble/crawl and fill it with the content below



Tip: wget "https://s.apache.org/mOkz" -O src/main/java/com/digitalpebble/crawl/JobBoardNavigationFilter.java


The approach used here it to generate a dummy HTML content and create links for all the job pages, while iterating on the result pages. This class gets called by the Selenium-based protocol implementation.


Now, let’s create a new file navigationfilters.json in the directory resources and give it the following content
{
 "com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters": [
   {
     "class": "com.digitalpebble.crawl.JobBoardNavigationFilter",
     "name": "JobBoard"
   }
 ]
}


Finally, we specify the name of the file we just created in the config with

navigationfilters.config.file: navigationfilters.json

Don’t forget to recompile the code with `mvn clean package` before launching the crawl. This time we’ll just check that we get all the links to the job pages in one go.

storm jar target/selenium-tutorial-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 60000 | grep DISCOVERED


Note: why not download chromedriver and use it instead of PhantomJS? By default, chromedriver does not run in headless mode so you could see the browser being driven by the navigation filter, including the stuff you usually don’t notice, like the robots.txt file being fetched.


Conclusion


The resources covered here are the very first step towards making StormCrawler handle dynamic content and there is much work to do on improving it, however, the brand new protocol based on Selenium should already be a useful starting point. I hope you'll give it a try, happy crawling!



No comments:

Post a Comment

Note: only a member of this blog may post a comment.