DigitalPebble's Blog: big data

Showing posts with label big data. Show all posts

Thursday, 22 November 2018

What's new in StormCrawler 1.12

The previous release was only last month but I decided to ship this one now as it contains several bugfixes and improvements which many users would benefit from.

As you can see below, the main changes are around protocols and sitemaps. We have used Selenium and OKHTTP a lot recently to deal with dynamic websites and the changes below definitely help for these. There is also an important bugfix for JSOUP (#653) and various other improvements.

As usual, we advise users to upgrade to this version.

Dependency upgrades

JSOUP 1.11.3 (#663)
Elasticsearch 6.5.0 (#661)
Jackson and Wiremock dependencies (#640)

Core

Post JSON data with OKHTTP protocol via metadata (#641)
Selenium RemoteDriverProtocol triggered by K/V in metadata (#642)
SeleniumProtocol NavigationFilters not reached in case of a redirection (#643)
Limit crawl to URLs found in sitemaps (#645)
spout.reset.fetchdate.after based on time when query was set to NOW (#648)
Avoid StackOverflowError when generating DocumentFragment from JSOUP (#653)
redirected sitemaps don't have isSitemap=true (#660)
Staggered scheduling of sitemap URLs (#657)
Scheduling -> round to the closest second, minute or hour (#654)
FetcherBolt don't add discovered sitemaps if the robots rules do not allow them (#662)

WARC

WARC record format: trailing zero byte causes WARC parser to fail (#652)

Elasticsearch

ES IndexerBolt track number of batch sent (#540)
Rename index index into docs (#649)
ES StatusMetricsBolt generate metrics for total number of docs (#651)

Coming next...

The release of Storm 2.0.0 has taken longer than expected, which is partly my fault as I reported a number of issues. These issues have now been fixed and hopefully, 2.0.0 will be out soon. As mentioned last month, there's a branch of StormCrawler which works on the Storm 2.x branch. Give it a try if you want to be on the cutting edge!

Finally, there will be a StormCrawler workshop in Vilnius next week. I am sure tickets are still available if you fancy a last minute trip to Lithuania.

As usual, thanks to all contributors and users. Happy crawling!

UPDATE

There were 2 bugs in release 1.12 which have been fixed in 1.12.1, see details on

https://github.com/DigitalPebble/storm-crawler/milestone/23?closed=1

Thursday, 27 April 2017

Crawl dynamic content with Selenium and StormCrawler

Many websites rely on AJAX to provide smooth and reactive web applications and/or single page websites. While this works fine for humans using modern browsers, this is often challenging for robots as they can’t interpret the Javascript and usually rely on low-level HTTP protocol implementations to get the binary content. Even Google have announced only as recently as October 2015 that their crawlers can handle dynamic content, even though tests have shown that this is still far from being perfect.

Support for dynamic content is something that many users have asked for in StormCrawler and I am pleased to announce that we have recently committed code for this. The next release of StormCrawler (1.5) will contain a Selenium WebDriver-based protocol implementation so let’s have a sneak preview of how to use it and what it can do for you.

Prerequisites

The instructions below are based on Linux commands. You will need to install Java 8 and Maven to compile StormCrawler as well as PhantomJS (2.1.1 or above), which we will connect to via WebDriver. You might want to install Apache Storm, even though this is not a strict requirement as we’ll see below.

Until StormCrawler 1.5 is released, you will need to get the master branch, either with Git or by downloading the code from https://github.com/DigitalPebble/storm-crawler. Once this is done, cd to storm-crawler and run `mvn clean install`. This should put the storm-crawler artefacts in your local Maven repository, ready to use for the next step. This won’t be needed once 1.5 is released and you will be able to get the artefacts straight from Maven Central.

Simple example

Let’s first build a StormCrawler project using the Maven archetype:

mvn archetype:generate -B -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.5 -DgroupId=com.digitalpebble.crawl -DartifactId=selenium-tutorial -Dversion=1.0-SNAPSHOT -Dpackage=com.digitalpebble.crawl

This will give you a basic set of resources and configuration for StormCrawler. Go to the selenium-tutorial directory and build the uber jar with `mvn clean package`. We are now ready to go with a simple example.

Edit the file crawler.flux and set https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing as value for the constructorArgs in the spout config as shown below:

If you look at the source of that page, you’ll see that it consists mostly of Javascript. Fine for our browsers, but how does StormCrawler fare on it? With Storm installed and accessible on the command line, let’s do

storm jar target/selenium-tutorial-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 60000

This will start the topology defined in the Flux file and let it run for one minute.

Note: the command above assumes that you have installed Storm. Alternatively, you can run the code directly with Maven like so:

mvn clean compile exec:java -Dexec.mainClass=org.apache.storm.flux.Flux -Dexec.args="--local crawler.flux --sleep 60000"

The console will display a lot of logs about the components being initialised but also the status of the URLs (e.g. FETCHED, DISCOVERED, etc...), the fields extracted from the documents fetched and various metrics. To remove the latter, you can comment out the section topology.metrics.consumer.register in crawler-conf.yaml.

Tip: if you are feeling adventurous, have a look at the other entries from the conf files e.g. remove domain=domain from indexer.md.mapping and see how that affects the output below.

Regardless of whether you ran the topology using Storm or Maven, you should see an output similar to this:

content

url https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing

domain dagbladet.no

description Bakte poteter blir like gode når de bakes i ovnen uten folie rundt.

title Dagbladet Mat

https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing FETCHED Thu Apr 27 14:46:59 BST 2017

The first 5 lines were generated by the StdOutIndexer and as we can see, no text content was generated at all, the title is a generic one and no other fields could be extracted. Further down, a single line was generated by the StdOutStatusUpdater, indicating that the URL was successfully fetched, however, no outlinks were discovered at all (we would have seen lines with a DISCOVERED status).

Selenium to the rescue

Time to put our brand new protocol implementation to use. Edit the file crawler-conf.yaml and add

http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

selenium.addresses: "http://localhost:9515"

This tells StormCrawler to use the custom protocol implementations and connect to a WebDriver server on port 9515.

Open a different console and run `phantomjs --webdriver 9515` then run the topology again and look at the output

content 2873 chars

url https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing

keywords mat,oppskrift,kokker,råvarer,ingredienser,bakt,potet,med,rømme-,og,blåmuggostdressing

domain dagbladet.no

description Bakte poteter blir like gode når de bakes i ovnen uten folie rundt.

title Bakt potet med rømme- og blåmuggostdressing - Oppskrift | Dagbladet Mat

This time we got some textual content, the correct title and were able to extract keywords. As you’ve certainly noticed, we got all sorts of outlinks, similar to what we can observe with a browser.

What happened under the bonnet is that PhantomJS gave us a fully interpreted HTML page, on which we ran our JSoup parser. The latter used the ParseFilters defined in src/main/resources/parsefilters.json to extract the metadata displayed by the indexer later on (i.e. title, description, domain, keywords, canonical).

Let’s now look at a slightly more complex scenario.

NavigationFilters

Websites often use Javascript for interactions within a page and navigation through the content. If we look at https://rn12.ultipro.com/SOU1022/JobBoard/ListJobs.aspx for instance, we can see that the pagination for the result lists is done in Javascript. Assuming that we want to extract all the jobs listed for that board, we would be able to get the links from the initial page with the simple HTTP protocol implementation but not the links to the following result pages as they are handled with AJAX.

Luckily, we can implement the navigation logic by implementing a class extending NavigationFilter. First, let’s create a new file JobBoardNavigationFilter.java in src/main/java/com/digitalpebble/crawl and fill it with the content below

Tip: wget "https://s.apache.org/mOkz" -O src/main/java/com/digitalpebble/crawl/JobBoardNavigationFilter.java

The approach used here it to generate a dummy HTML content and create links for all the job pages, while iterating on the result pages. This class gets called by the Selenium-based protocol implementation.

Now, let’s create a new file navigationfilters.json in the directory resources and give it the following content

{

"com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters": [

{

"class": "com.digitalpebble.crawl.JobBoardNavigationFilter",

"name": "JobBoard"

}

]

}

Finally, we specify the name of the file we just created in the config with

navigationfilters.config.file: navigationfilters.json

Don’t forget to recompile the code with `mvn clean package` before launching the crawl. This time we’ll just check that we get all the links to the job pages in one go.

storm jar target/selenium-tutorial-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 60000 | grep DISCOVERED

Note: why not download chromedriver and use it instead of PhantomJS? By default, chromedriver does not run in headless mode so you could see the browser being driven by the navigation filter, including the stuff you usually don’t notice, like the robots.txt file being fetched.

Conclusion

The resources covered here are the very first step towards making StormCrawler handle dynamic content and there is much work to do on improving it, however, the brand new protocol based on Selenium should already be a useful starting point. I hope you'll give it a try, happy crawling!

Wednesday, 29 March 2017

Need billions of web pages? Don't bother crawling...

How big did you say?

I am often contacted by prospective clients to help them crawl the web on a very large scale or find questions such as this one on StackOverflow. What people want to achieve with web data varies greatly from one case to the next: some need to extract specific data from as many pages as possible, some want to build search engines, while others wish to test the accuracy of a machine learning model on real data.

Luckily, there are resources available for large scale web crawling, both on the platform side (e.g. Amazon Web Services) and the software side (StormCrawler, Apache Nutch), however large scale crawling (think billions of pages and hundreds of servers) is costly, complex and time-consuming. At DigitalPebble, we help our clients with such tasks but what I often tend to recommend as an initial step is to have a look at CommonCrawl.

CommonCrawl to the rescue

CommonCrawl is a non-profit organisation which provides web crawl data for free. Their datasets are used by various organisations, both in academia and industry, as can be seen on the examples page. The applications range from machine learning to natural language processing or computational linguistics. For instance, at DigitalPebble, we have used the CommonCrawl dataset for some of our clients for information extraction (phone numbers and contact details publicly available), machine learning (to check the accuracy of a classifier on real, big, messy data) as well as lexicometry (get frequencies of anchor tags). I should also mention that CommonCrawl themselves are clients of ours: we developed Apache Nutch resources for them and also ran their February 2016 web crawl. We also contributed to the set up of their news crawl (see below).

CommonCrawl provides two types of datasets, both hosted on Amazon S3 as part of the Amazon Public Datasets program.

Web crawl

The main dataset is released on a monthly basis and consists of billions of web pages stored in WARC format on AWS S3. The latest release had 3.08 billion web pages and about 250 TiB of uncompressed content: that’s a lot of data to play with, and it comes for free!

These pages are mainly HTML documents, but there are also a few PDF and images. Until recently, the coverage was very US-centric and the datasets contained mostly the same URLs from one release to the next, but this is no longer the case as European domain names and the top 1 million Alexa domains are crawled (see details on http://commoncrawl.org/2017/03/february-2017-crawl-archive-now-available/). Interestingly, CommonCrawl use Apache Nutch to generate their datasets, albeit with a few home-made modifications.

Basically, each release is split into 100 segments. Each segment has three types of files WARC, WAT and WET. As explained on the Get Started page:

WARC files store the raw crawl data
WAT files store computed metadata for the data stored in the WARC
WET files store extracted plaintext from the data stored in the WARC

Note that WAT and WET are in the WARC format too! In fact, the WARC format is nothing more than an envelope with metadata and content. In the case of the WARC files, that content is the HTTP requests and responses, whereas for the WET files, it is simply the plain text extracted from the WARCs. The WAT files contain a JSON representation of metadata extracted from the WARCs e.g. title, links etc…

So, not only have CommonCrawl given you loads of web data for free, they’ve also made your life easier by preprocessing the data for you. For many tasks, the content of the WAT or WET files will be sufficient and you won’t have to process the WARC files.

This should not only help you simplify your code but also make the whole processing faster. We recently ran an experiment on CommonCrawl where we needed to extract anchor text from HTML pages. We initially wrote some MapReduce code to extract the binary content of the pages from their WARC representation, processed the HTML with JSoup and reduced on the anchor text. Processing a single WARC segment took roughly 100 minutes on a 10-node EMR cluster. We then simplified the extraction logic, took the WAT files as input and the processing time dropped to 17 minutes on the same cluster. This gain was partly due to not having to parse the web pages, but also to the fact that WAT files are a lot smaller than their WARC counterparts.

News dataset

Unlike the main web crawl, the news dataset is released continuously. As its name suggests, it consists exclusively of news pages and articles as described on http://commoncrawl.org/2016/10/news-dataset-available/. There are between 3 and 5 WARC files (1GB each) generated daily, corresponding to 300 to 400 thousand pages. In total, over 25 million news pages have been crawled to date. The dataset contains WARC files only so you will have to write some code to extract the text and metadata yourself.

The news dataset is generated using our very own StormCrawler and the code of the news crawl is publicly available on CommonCrawl’s GitHub account.

Resources

The Get Started page on the CommonCrawl website contains useful pointers to libraries and code in various programming languages to process the datasets. There is also a list of tutorials and presentations.

It is also worth noting that CommonCrawl provides an index per release, allowing you to search for URLs (including wildcards) and retrieve the segment and offset therein where the content of the URL is stored e.g.

{ "urlkey": "org,apache)/", "timestamp": "20170220105827", "status": "200", "url": "http://apache.org/", "filename": "crawl-data/CC-MAIN-2017-09/segments/1487501170521.30/warc/CC-MAIN-20170219104610-00206-ip-10-171-10-108.ec2.internal.warc.gz", "length": "13315", "mime": "text/html", "offset": "14131184", "digest": "KJREISJSKKGH6UX5FXGW46KROTC6MBEM" }

This is useful but only if you are interested in a limited number of URLs which you know in advance. In many cases, what you know in advance is what you want to extract, not where it will be extracted from. For situations such as these, you will need distributed batch-processing using MapReduce in Apache Hadoop or Apache Spark.

As hinted above, I tend to use AWS EMR (ElasticMapReduce). Running the code in AWS makes sense as the data sets are stored on S3 so access is fast and there is no transfer cost, also the EC2 instances will have the credentials pre-set so there is no additional configuration needed to access the data. There is an additional cost in using EMR but this saves me from having to configure Hadoop. In addition, I usually store the output of the reduce steps on a S3 bucket so that nothing is kept on HDFS and I can use spot instances to keep the cost down. If they get terminated, nothing is lost. Of course, other platforms (Azure, Google) or alternatives to EMR (Hortonworks HDP) can be used instead.

Finally, I implement the logic with MapReduce in Java thanks to libraries such as warc-hadoop which deals with the low-level access to WARC files. If you need to process CommonCrawl with existing frameworks and libraries such as Apache UIMA, Tika or GATE, our good old open source project Behemoth could help as it can ingest WARCs too!

Conclusion

As we’ve seen, CommonCrawl is an awesome resource and should be the first thing you try before embarking on web scale crawling (although if you must, DigitalPebble would be happy to help). It is large, it is free, it is relatively easy to process and a lot of effort has been put into making your life easier.

Web data are big, messy and often don’t give the results you expect. Processing the CommonCrawl dataset is a great way of checking your assumptions at a fraction of the cost of a web scale crawl. It also saves you time, as the fetch politeness has been done for you but on the minus side, you will be able to process only content allowed by robots.txt directives as CommonCrawl’s crawler is polite (but then yours should be too).

I hope you will give CommonCrawl a try and if you find it useful, you can donate to the project.