Wednesday 29 March 2017

Full day workshop(s) on StormCrawler (+Elasticsearch and Kibana)


I will be running a full-day workshop on crawling with StormCrawler on the 24th April in Berlin. See full details on https://endoctus.com/course/web-crawling-with-stormcrawler.

Please find the program below:

In this workshop, we will explore StormCrawler a collection of resources for building low-latency, large scale web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what Storm-Crawler provides, we'll put it to use straight away for a simple crawl before moving on to the deployed mode of Storm

In the second part of the session, we will then introduce metrics and index documents with Elasticsearch and Kibana and dive into data extraction. Finally, we'll cover recursive crawls and scalability. This course will be hands-on: attendees will run the code on their own machines.  

This course will suit Java developers with an interest in big data, stream processing, web crawling and search. It will provide a practical introduction to both Apache Storm and Elasticsearch as well of course as StormCrawler and should not require advanced programming skills. 

Duration : 2x3 hours 


PS: Do you follow DigitalPebble or StormCrawler on Twitter? Announcements and updates are made there (as well as all sorts of interesting news of course!) 

Need billions of web pages? Don't bother crawling...

How big did you say?

I am often contacted by prospective clients to help them crawl the web on a very large scale or find questions such as this one on StackOverflow. What people want to achieve with web data varies greatly from one case to the next: some need to extract specific data from as many pages as possible, some want to build search engines, while others wish to test the accuracy of a machine learning model on real data.  

Luckily, there are resources available for large scale web crawling, both on the platform side (e.g. Amazon Web Services) and the software side (StormCrawler, Apache Nutch), however large scale crawling (think billions of pages and hundreds of servers) is costly, complex and time-consuming.  At DigitalPebble, we help our clients with such tasks but what I often tend to recommend as an initial step is to have a look at CommonCrawl.

CommonCrawl to the rescue

CommonCrawl is a non-profit organisation which provides web crawl data for free. Their datasets are used by various organisations, both in academia and industry, as can be seen on the examples page. The applications range from machine learning to natural language processing or computational linguistics. For instance, at DigitalPebble, we have used the CommonCrawl dataset for some of our clients for information extraction (phone numbers and contact details publicly available), machine learning (to check the accuracy of a classifier on real, big, messy data) as well as lexicometry (get frequencies of anchor tags). I should also mention that CommonCrawl themselves are clients of ours: we developed Apache Nutch resources for them and also ran their February 2016 web crawl. We also contributed to the set up of their news crawl (see below).

CommonCrawl provides two types of datasets, both hosted on Amazon S3 as part of the Amazon Public Datasets program.

Web crawl


The main dataset is released on a monthly basis and consists of billions of web pages stored in WARC format on AWS S3. The latest release had 3.08 billion web pages and about 250 TiB of uncompressed content: that’s a lot of data to play with, and it comes for free!

These pages are mainly HTML documents, but there are also a few PDF and images. Until recently, the coverage was very US-centric and the datasets contained mostly the same URLs from one release to the next, but this is no longer the case as European domain names and the top 1 million Alexa domains are crawled (see details on http://commoncrawl.org/2017/03/february-2017-crawl-archive-now-available/). Interestingly, CommonCrawl use Apache Nutch to generate their datasets, albeit with a few home-made modifications.

Basically, each release is split into 100 segments. Each segment has three types of files WARC, WAT and WET. As explained on the Get Started page:

  • WARC files store the raw crawl data
  • WAT files store computed metadata for the data stored in the WARC
  • WET files store extracted plaintext from the data stored in the WARC

Note that WAT and WET are in the WARC format too! In fact, the WARC format is nothing more than an envelope with metadata and content. In the case of the WARC files, that content is the HTTP requests and responses, whereas for the WET files, it is simply the plain text extracted from the WARCs. The WAT files contain a JSON representation of metadata extracted from the WARCs e.g. title, links etc…

So, not only have CommonCrawl given you loads of web data for free, they’ve also made your life easier by preprocessing the data for you. For many tasks, the content of the WAT or WET files will be sufficient and you won’t have to process the WARC files.

This should not only help you simplify your code but also make the whole processing faster. We recently ran an experiment on CommonCrawl where we needed to extract anchor text from HTML pages. We initially wrote some MapReduce code to extract the binary content of the pages from their WARC representation, processed the HTML with JSoup and reduced on the anchor text. Processing a single WARC segment took roughly 100 minutes on a 10-node EMR cluster. We then simplified the extraction logic, took the WAT files as input and the processing time dropped to 17 minutes on the same cluster. This gain was partly due to not having to parse the web pages, but also to the fact that WAT files are a lot smaller than their WARC counterparts.

News dataset


Unlike the main web crawl, the news dataset is released continuously. As its name suggests, it consists exclusively of news pages and articles as described on http://commoncrawl.org/2016/10/news-dataset-available/. There are between 3 and 5 WARC files (1GB each) generated daily, corresponding to 300 to 400 thousand pages. In total, over 25 million news pages have been crawled to date. The dataset contains WARC files only so you will have to write some code to extract the text and metadata yourself.

The news dataset is generated using our very own StormCrawler and the code of the news crawl is publicly available on CommonCrawl’s GitHub account.

Resources

The Get Started page on the CommonCrawl website contains useful pointers to libraries and code in various programming languages to process the datasets. There is also a list of tutorials and presentations.

It is also worth noting that CommonCrawl provides an index per release, allowing you to search for URLs (including wildcards) and retrieve the segment and offset therein where the content of the URL is stored e.g.


{ "urlkey": "org,apache)/", "timestamp": "20170220105827", "status": "200", "url": "http://apache.org/", "filename": "crawl-data/CC-MAIN-2017-09/segments/1487501170521.30/warc/CC-MAIN-20170219104610-00206-ip-10-171-10-108.ec2.internal.warc.gz", "length": "13315", "mime": "text/html", "offset": "14131184", "digest": "KJREISJSKKGH6UX5FXGW46KROTC6MBEM" }

This is useful but only if you are interested in a limited number of URLs which you know in advance. In many cases, what you know in advance is what you want to extract, not where it will be extracted from. For situations such as these, you will need distributed batch-processing using MapReduce in Apache Hadoop or Apache Spark.

As hinted above, I tend to use AWS EMR (ElasticMapReduce). Running the code in AWS makes sense as the data sets are stored on S3 so access is fast and there is no transfer cost, also the EC2 instances will have the credentials pre-set so there is no additional configuration needed to access the data. There is an additional cost in using EMR but this saves me from having to configure Hadoop. In addition, I usually store the output of the reduce steps on a S3 bucket so that nothing is kept on HDFS and I can use spot instances to keep the cost down. If they get terminated, nothing is lost. Of course, other platforms (Azure, Google) or alternatives to EMR (Hortonworks HDP) can be used instead.

Finally, I implement the logic with MapReduce in Java thanks to libraries such as warc-hadoop which deals with the low-level access to WARC files. If you need to process CommonCrawl with existing frameworks and libraries such as Apache UIMA, Tika or GATE, our good old open source project Behemoth could help as it can ingest WARCs too!

Conclusion

As we’ve seen, CommonCrawl is an awesome resource and should be the first thing you try before embarking on web scale crawling (although if you must, DigitalPebble would be happy to help). It is large, it is free, it is relatively easy to process and a lot of effort has been put into making your life easier.

Web data are big, messy and often don’t give the results you expect. Processing the CommonCrawl dataset is a great way of checking your assumptions at a fraction of the cost of a web scale crawl. It also saves you time, as the fetch politeness has been done for you but on the minus side, you will be able to process only content allowed by robots.txt directives as CommonCrawl’s crawler is polite (but then yours should be too).

I hope you will give CommonCrawl a try and if you find it useful, you can donate to the project.

Thursday 23 March 2017

What’s new in StormCrawler 1.4

StormCrawler 1.4 has just been released! As usual, all users are advised to upgrade to this version as it fixes some bugs and contains quite a few new functionalities.

Core dependencies upgrades

  • Httpclient 4.5.3
  • Storm 1.0.3 #437

Core module

  • JSoupParser does not dedup outlinks properly, #375
  • Custom schedule based on metadata for non-success pages, #386
  • Adaptive fetch scheduler #407
  • Sitemap: increased default offset for guessing + made it configurable  #409
  • Added URLFilterBolt + use it in ESSeedInjector #421
  • URLStreamGrouping 425
  • Better handling of redirections for HTTP robots #4372d16
  • HTTP Proxy over Basic Authentication #432
  • Improved metrics for status updater cache (hits and misses) #434
  • File protocol implementation #436
  • Added CollectionMetrics (used in ES MetricsConsumer + ES Spout, see below) #7d35acb

AWS

  • Added code for caching and retrieving content from AWS S3 #e16b66ef

SOLR

  • Basic upgrade to Solr 6.4.1
  • Use ConcurrentUpdateSolrClient; #183

Elasticsearch

  • Various changes to StatusUpdaterBolt
    Fixed bugs introduced in 1.3 (use of SHA ID), synchronisation issues, better logging, optimisation of docs sent and more robust handling of tuples waiting to be acked (#426). The most important change is a bug fix whereby the cache was never hit (#442) which had a large impact on performance.
  • Simplified README + removed bigjar profile from pom #414
  • Provide basic mapping for doc index #433
  • Simple Grafana dashboard for SC metrics, #380
  • Generate metrics about status counts, #389
  • Spouts report time taken by queries using CollectionMetric, #439 - as illustrated below
Spout query times displayed by Grafana
(illustrating the impact of SamplerAggregationSpout on a large status index )

Coming next?

As usual, it is not clear what the next release will contain but hopefully, we'll switch to Elasticsearch 5 (you can already take it from the branch es5.3) and provide resources for Selenium (see branch jBrowserDriver). As I pointed out in my previous post, getting early feedback on work in progress is a great way of contributing to the project.

We'll probably also upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser. We might move to one of the next releases of Apache Storm, where a recent contribution I made will make it possible to use Elasticsearch 5. Also, some of our StormCrawler code has been donated to Storm, which is great!

In the meantime and as usual, thanks to all contributors and users and happy crawling!

PS: I will be running a workshop in Berlin next month about StormCrawler, Storm in general and Elasticsearch


Friday 17 March 2017

Contribute to an open source project beyond code

I was recently contacted by someone who liked StormCrawler, wanted to contribute to it and asked me how to do so. While most contributions to open source projects take the form of code to either fix bugs or add new functionalities, there are various other ways in which people can contribute.

Here is what I replied to him, and while the examples below are about StormCrawler, the same ideas can apply to pretty much any open source project.

  • Spread the word: if you use StormCrawler, why not blog/tweet about it and get listed on the powered by page? The more people see that it is used, the more confident they become in adopting it. If you are not too shy: why not give a short presentation at a local tech meetup or a bigger conference?
  • Help with the documentation or write tutorials: we have WIKI pages and various instructions on the site - going through those would be a good way of learning about Apache Storm and StormCrawler while at the same time make a useful contribution.
  • Find bugs and possible improvements: run the code, benchmark it, look at the logs for unexpected things. Just play and see! If something is not clear, then the docs can be improved (see previous point).
  • Test things in branches / PRs: for instance, I started work on jBrowserDriver and Elasticsearch 5. Giving new functionalities an early try is fab.
  • Help others:  you have used StormCrawler a bit? Join the mailing list or follow StackOverflow and help newcomers overcome the hurdles as you did.
  • Donate resources: your company has one or more servers they are not using (unlikely but who knows)? You have AWS credits and don't know what to do with them? We can always do with test machines.
Any of those forms of contributions is valuable! Writing code is good but that's just one part of making a project successful.

PS: if you are wondering what happened with that prospective contributor, he's taken one of the open issues and doing great work on it!