Showing posts with label behemoth. Show all posts
Showing posts with label behemoth. Show all posts

Wednesday 29 March 2017

Need billions of web pages? Don't bother crawling...

How big did you say?

I am often contacted by prospective clients to help them crawl the web on a very large scale or find questions such as this one on StackOverflow. What people want to achieve with web data varies greatly from one case to the next: some need to extract specific data from as many pages as possible, some want to build search engines, while others wish to test the accuracy of a machine learning model on real data.  

Luckily, there are resources available for large scale web crawling, both on the platform side (e.g. Amazon Web Services) and the software side (StormCrawler, Apache Nutch), however large scale crawling (think billions of pages and hundreds of servers) is costly, complex and time-consuming.  At DigitalPebble, we help our clients with such tasks but what I often tend to recommend as an initial step is to have a look at CommonCrawl.

CommonCrawl to the rescue

CommonCrawl is a non-profit organisation which provides web crawl data for free. Their datasets are used by various organisations, both in academia and industry, as can be seen on the examples page. The applications range from machine learning to natural language processing or computational linguistics. For instance, at DigitalPebble, we have used the CommonCrawl dataset for some of our clients for information extraction (phone numbers and contact details publicly available), machine learning (to check the accuracy of a classifier on real, big, messy data) as well as lexicometry (get frequencies of anchor tags). I should also mention that CommonCrawl themselves are clients of ours: we developed Apache Nutch resources for them and also ran their February 2016 web crawl. We also contributed to the set up of their news crawl (see below).

CommonCrawl provides two types of datasets, both hosted on Amazon S3 as part of the Amazon Public Datasets program.

Web crawl


The main dataset is released on a monthly basis and consists of billions of web pages stored in WARC format on AWS S3. The latest release had 3.08 billion web pages and about 250 TiB of uncompressed content: that’s a lot of data to play with, and it comes for free!

These pages are mainly HTML documents, but there are also a few PDF and images. Until recently, the coverage was very US-centric and the datasets contained mostly the same URLs from one release to the next, but this is no longer the case as European domain names and the top 1 million Alexa domains are crawled (see details on http://commoncrawl.org/2017/03/february-2017-crawl-archive-now-available/). Interestingly, CommonCrawl use Apache Nutch to generate their datasets, albeit with a few home-made modifications.

Basically, each release is split into 100 segments. Each segment has three types of files WARC, WAT and WET. As explained on the Get Started page:

  • WARC files store the raw crawl data
  • WAT files store computed metadata for the data stored in the WARC
  • WET files store extracted plaintext from the data stored in the WARC

Note that WAT and WET are in the WARC format too! In fact, the WARC format is nothing more than an envelope with metadata and content. In the case of the WARC files, that content is the HTTP requests and responses, whereas for the WET files, it is simply the plain text extracted from the WARCs. The WAT files contain a JSON representation of metadata extracted from the WARCs e.g. title, links etc…

So, not only have CommonCrawl given you loads of web data for free, they’ve also made your life easier by preprocessing the data for you. For many tasks, the content of the WAT or WET files will be sufficient and you won’t have to process the WARC files.

This should not only help you simplify your code but also make the whole processing faster. We recently ran an experiment on CommonCrawl where we needed to extract anchor text from HTML pages. We initially wrote some MapReduce code to extract the binary content of the pages from their WARC representation, processed the HTML with JSoup and reduced on the anchor text. Processing a single WARC segment took roughly 100 minutes on a 10-node EMR cluster. We then simplified the extraction logic, took the WAT files as input and the processing time dropped to 17 minutes on the same cluster. This gain was partly due to not having to parse the web pages, but also to the fact that WAT files are a lot smaller than their WARC counterparts.

News dataset


Unlike the main web crawl, the news dataset is released continuously. As its name suggests, it consists exclusively of news pages and articles as described on http://commoncrawl.org/2016/10/news-dataset-available/. There are between 3 and 5 WARC files (1GB each) generated daily, corresponding to 300 to 400 thousand pages. In total, over 25 million news pages have been crawled to date. The dataset contains WARC files only so you will have to write some code to extract the text and metadata yourself.

The news dataset is generated using our very own StormCrawler and the code of the news crawl is publicly available on CommonCrawl’s GitHub account.

Resources

The Get Started page on the CommonCrawl website contains useful pointers to libraries and code in various programming languages to process the datasets. There is also a list of tutorials and presentations.

It is also worth noting that CommonCrawl provides an index per release, allowing you to search for URLs (including wildcards) and retrieve the segment and offset therein where the content of the URL is stored e.g.


{ "urlkey": "org,apache)/", "timestamp": "20170220105827", "status": "200", "url": "http://apache.org/", "filename": "crawl-data/CC-MAIN-2017-09/segments/1487501170521.30/warc/CC-MAIN-20170219104610-00206-ip-10-171-10-108.ec2.internal.warc.gz", "length": "13315", "mime": "text/html", "offset": "14131184", "digest": "KJREISJSKKGH6UX5FXGW46KROTC6MBEM" }

This is useful but only if you are interested in a limited number of URLs which you know in advance. In many cases, what you know in advance is what you want to extract, not where it will be extracted from. For situations such as these, you will need distributed batch-processing using MapReduce in Apache Hadoop or Apache Spark.

As hinted above, I tend to use AWS EMR (ElasticMapReduce). Running the code in AWS makes sense as the data sets are stored on S3 so access is fast and there is no transfer cost, also the EC2 instances will have the credentials pre-set so there is no additional configuration needed to access the data. There is an additional cost in using EMR but this saves me from having to configure Hadoop. In addition, I usually store the output of the reduce steps on a S3 bucket so that nothing is kept on HDFS and I can use spot instances to keep the cost down. If they get terminated, nothing is lost. Of course, other platforms (Azure, Google) or alternatives to EMR (Hortonworks HDP) can be used instead.

Finally, I implement the logic with MapReduce in Java thanks to libraries such as warc-hadoop which deals with the low-level access to WARC files. If you need to process CommonCrawl with existing frameworks and libraries such as Apache UIMA, Tika or GATE, our good old open source project Behemoth could help as it can ingest WARCs too!

Conclusion

As we’ve seen, CommonCrawl is an awesome resource and should be the first thing you try before embarking on web scale crawling (although if you must, DigitalPebble would be happy to help). It is large, it is free, it is relatively easy to process and a lot of effort has been put into making your life easier.

Web data are big, messy and often don’t give the results you expect. Processing the CommonCrawl dataset is a great way of checking your assumptions at a fraction of the cost of a web scale crawl. It also saves you time, as the fetch politeness has been done for you but on the minus side, you will be able to process only content allowed by robots.txt directives as CommonCrawl’s crawler is polite (but then yours should be too).

I hope you will give CommonCrawl a try and if you find it useful, you can donate to the project.

Friday 28 November 2014

Generating a test corpus for Apache Tika from CommonCrawl : Behemoth to the rescue!

It's been a while since I last blogged, in particular about Behemoth.  For those who don't know about it, Behemoth is an open source project under Apache license which helps with large scale processing of documents by providing wrappers for various libraries (Tika, UIMA, GATE etc...), a common document representation used by these wrappers and some utility classes for manipulating the dataset generated by it. Behemoth runs on Hadoop and has been used in various projects over the years. I have started working on an equivalent for Apache Spark called Azazello (to continue with the same litterary reference) but it is still early days.

I have been using Behemoth in the last couple of days to help with TIKA-1302. What we are trying to do there is to have as large a test dataset as possible for Tika. We thought it would be interesting to use data from Common Crawl, in order to get (1) loads of it (2) things seen in the wild (3) various formats. 

Behemoth steps

Luckily Behemoth can process WARC files such as the ones generated by Common Crawl with its IO module. Assuming you have cloned the source code of Behemoth, compiled it with Maven and have Hadoop installed all you need to do is call : 

hadoop jar io/target/behemoth-io-*-SNAPSHOT-job.jar com.digitalpebble.behemoth.io.warc.WARCConverterJob -D fs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY -D fs.s3n.awsSecretAccessKey=$AWS_SECRET_KEY -D document.filter.mimetype.keep=.+[^html]  s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00099-ip-10-60-113-184.ec2.internal.warc.gz behemothCorpus

Please note the document.filter.mimetype.keep=.+[^html] parameter, this allows us to filter the input documents and keep only the ones that do not have html in their mimetype (as returned by the web servers).

The command above will generate a Hadoop sequence file containing serialized BehemothDocuments. The reader command can be used to have a peek at the content of the corpus e.g.

./behemoth reader -i behemothCorpus -m | more

url: http://0.static.wix.com/dicons/7ffb03_2f63cf23ec1107e4ed9824f6c98e5847.wix_doc_ico
contentType: image/jpeg
metadata: 
Date: Tue, 21 Oct 2014 08:46:31 GMT
ETag: "6afff623058a88cb23a5b18c934ee8fd19192"
Server: s23.tam
Content-Type: image/jpeg
Connection: close
Content-Length: 19192
Cache-Control: max-age=604800
X-Seen-By: s23.tam_pp
IP: 207.36.47.4

[...]

The option -m displays the metadata, we could also display the binary content if we wanted to. See WIKI for the options available.

The next step is to generate an archive with the content of each file, for which we have the generic exporter command :

./behemoth exporter -i $segName -n URL -o file:///mnt/$segName -b

This gives us a a number of archives with the content of each document put in a separate file, with the URL used for its name. We can then push the resulting archives to the machine used for testing Tika.

Scaling with Amazon EMR

The commands above will work fine even on a laptop but since we are interested in processing a substantial amount of data we need a real Hadoop cluster.

I started a smallish 5 nodes Hadoop cluster with EMR, SSHed to the master, git cloned Behemoth, compiled it and pushed the segment URLs from the latest release of Common Crawl into a SQS queue then wrote a small script which pulls the segment URLs from the queue one by one, calls the WARCConverterJob then the exporter before pushing the archives to the machine used for testing Tika. The latter step is a bit of a bottleneck as it writes to the local filesystem on the master node.

On a typical segment (like s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-42/segments/1413507444312.8/) we filtered 30,369,012  and kept 431,546 documents. The top mimetypes look like this : 


166208 contentType: image/jpeg
  63097 contentType: application/pdf
  58531 contentType: text/plain
  38497 contentType: image/png
  28906 contentType: text/calendar
  10162 contentType: image/gif
   7005 contentType: audio/x-wav
   6604 contentType: application/json
   3136 contentType: text/HTML
   2932 contentType: unknown/unknown
   2799 contentType: video/x-ms-asf
   2609 contentType: image/jpg
   1868 contentType: application/zip
   1798 contentType: application/msword

The regular expression we used to filter the html documents did not take the uppercase variants into account : nevermind, it still removed most of them. 

What next?

One alternative to pushing the archives to an external server would be to run the tests with Behemoth, since it has an existing wrapper for Tika. This would make the tests completely scalable and we'd also be able to use the extra information available in the BehemothDocuments such as the mime-type returned by the servers.

We'll see how this dataset gets used in TIKA-1302There are many ways in which Behemoth can be used and it has quite a few modules available. The aim of this blog was to show how easy it is to process data on a large scale with it, with or without the CommonCrawl dataset.

By the way CommonCrawl is a great resource, please support it by donating if you can (http://commoncrawl.org/donate/).




Wednesday 5 September 2012

Using Behemoth on the CommonCrawl dataset

Behemoth is an open-source platform for document processing based on Hadoop which provides an excellent way to process document collections on a large scale, such as crawled pages obtained with Nutch or CommonCrawl. 

Today, we are going to use a segment of the CommonCrawl dataset and show how to import data in Behemoth, filter on some common attributes and generate vectors for clustering with Apache Mahout.

CommonCrawl 

The CommonCrawl dataset  (http://commoncrawl.org/) is an open repository of web crawl data comprising 3.8 billion documents that are universally accessible. The data is available in different formats, the most recent one separating the raw content (ARC) from the metadata in JSON and the text (HTML only).

The ARC and text formats can be handled by the CommonCrawl module in Behemoth. 
This module converts CommonCrawl data to SequenceFiles of BehemothDocuments.The difference between the documents obtained in one format or the other lies in what is added in the BehemothDocs, which is binary content for the ARC and text for the text format.

In order to access this source, you will need to get an AWS (Amazon Web Services) account, because using this data is non-free.

What you need to set up for this step:
https://github.com/DigitalPebble/behemoth
https://github.com/DigitalPebble/behemoth-commoncrawl



Getting the data

Once Behemoth and its module for CommonCrawl have been installed, we can go to the command line and “cd” into the behemoth-commoncrawl folder: 

We get the data from CommonCrawl and convert it into a Behemoth corpus: 

hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.commoncrawl.CommonCrawlConverterJob2012 -D fs.s3n.awsAccessKeyId={YOUR_OWN_ID} fs.s3n.awsSecretAccessKey={YOUR_OWN_KEY}   -D document.filter.mimetype.keep=application/pdf s3n://aws-publicdatasets/common-crawl/parse-output/segment/1350433107106/* test-crawlpdf  

  
In this example, we filter on the mime type, since we only want to import pdf documents.
By setting the filter: -D document.filter.mimetype.keep=application/pdf, we limit what is imported from CommonCrawl. The filter takes a regular expression and will import only those documents whose mime type matches the regular expression. Note that it is possible to filter based on other things such as the URL, the length of the document or any other metadata.


In order to inspect, what has just been imported, we can now call the CorpusReader and look at the content of the Behemoth sequence file.  



hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i test-crawlpdf -c -t


The document corpus excerpt gives you some information on the source of the document, the content type and shows the first lines of the binary content (parameter -c). Note the parameter -t which displays the text for the document, however since the documents were generated from the ARC, the text has not been extracted yet. 

Output after getting the data:

url: http://www.harrahsrincon.com/images/non_image_assets/RIN_New_spa_menu_web.pdf
contentType
: application/pdf
Content
:
%PDF-1.6
%����
101 0 obj <</Linearized 1/L 527200/O 104/E 88751/N 12/T 525137/H [ 736 441]>>
endobj
            
xref
101 22
0000000016 00000 n
0000001177 00000 n
0000001293 00000 n
0000001418 0


Text Extraction

To then obtain the text, we use the Tika module in Behemoth which extracts the text from the documents in a Behemoth sequence file.

hadoop jar ./behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver -i test-crawlpdf/* -o crawlpdf-Tika

Now, we again inspect the corpus and see the extracted text content (omitting the parameter -c)

Output after extracting the text content:

url: http://www.harrahsrincon.com/images/non_image_assets/RIN_New_spa_menu_web.pdf
contentType: application/pdf
Text:

777 Harrah’s Rincon Way  
Valley Center, CA 92082  

760-751-7709
www.harrahsrincon.com

Prices, hours of operation and treatments are subject to change.
Must be 21 or older to gamble. 



While Tika extracts the text content, it also generates annotations representing the original markup of a document (if present) and its metadata, which can be displayed with the parameters -m and -a.

Filtering on Language

Since, for the sake of argument,  we are only interested in the English documents in the corpus, we need to filter out all those which are in a different language. The language identification module uses the LangDetect library (http://code.google.com/p/language-detection/) to identify and add language IDs to each document.

We identify the language with:

(1) hadoop jar ./behemoth-lang*job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -i crawlpdf-Tika -o crawlpdf-Tika-lang
 

From the command line output or the hadoop jobtracker, one can see the distribution of languages in a corpus: 




Then we can filter on the language ID, in this case 'en' - for  English.

After having identified the languages, the filtering can be done either by using the CorpusFilter from the core module:

hadoop jar behemoth-core*-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusFilter -D document.filter.md.keep.lang=en -i crawlpdf-Tika-lang -o crawlpdf-Tika-EN

Note that we could have done the same as part of the language identification step with :

(2) hadoop jar behemoth-lang*-SNAPSHOT-job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -D document.filter.md.keep.lang=en -i crawlpdf-Tika -o crawlpdf-Tika-EN

If you are only interested in filtering, the first step shown here is optional - the identification and filtering can be done in one step as shown in (2). The corresponding jobtracker output would look like this: 

 

Clustering


Having filtered out all unwanted documents, we create the vectors representing the Behemoth documents, thanks to the resources in the Mahout module : 


hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth -i crawlpdf-Tika-EN -o crawl-pdf-vec --namedVector
 

Having successfully finished the preprocessing and vector generation with Behemoth, we now change to Mahout (available here) to do the clustering.

Using kmeans clustering in Mahout, there are two ways of generating the initial clusters:


1) One can specify the desired number of output clusters and the initial centroids are generated as a first step in kmeans. This will probably be best, if you do not know your data very well, but do know how many clusters you want to have in the end. 

2) Another option is to use canopy clustering, where you define a minimal distance between the centroids and the number of clusters depends on that distance and obviously also on the distance measure used.
There are ways to calculate the average distance between vectors in your corpus beforehand: 
Using more appropriate values for the distance will probably give a more representative clustering result. 

From your mahout folder: 

Thus, creating the initial centroids with canopy clustering:

mahout canopy -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/canopy-centroids -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.1 -t2 0.5 -cl

Then you call kmeans, while specifying the newly-generated canopy-centroids in the c-argument. The distance measure used here is Tanimoto, which takes into account the document length.

mahout kmeans -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/clusters -c crawl-pdf-vec/canopy-centroids/clusters-0-final  -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -x 10 -cd 0.1 -cl


Since we’re interested in seeing what documents are allocated to which cluster, the ClusterDocIDDumper in the Mahout module in Behemoth comes in useful: 

hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.util.ClusterDocIDDumper -i crawl-pdf-vec/clusters/clusteredPoints  -o crawl-pdf-vec/clusterID

To extract the results to the local file system:

hadoop fs -text crawl-pdf-vec/clusterID > crawlpdf-clusterID


et voila:

….
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/1505OK.pdf    6
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/1511OK.pdf    37
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/OKVocablist.pdf    19
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1501OR.pdf    23
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1502OR.pdf    42
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1503OR.pdf    43
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1511OR.pdf    44
http://hdmaster.com/testing/cnatesting/tennessee/tnformpages/tnforms/1402TN.pdf    10
....


Conclusion


This was merely an exercise meant to illustrate some of the capabilities of Behemoth and how it could be used to process the CommonCrawl dataset. There are  more modules  available, such as the GATE or UIMA ones that we could have used to extract named entities, or the SOLR module to index the documents. 


We actually used the CommonCrawl dataset with Behemoth for one of our clients in order to identify CVs automatically using our  text classification module alongside the Tika, GATE and Language ID modules. This was a great way of checking some of our assumptions before applying the same processes to the output of a Nutch crawl. CommonCrawl is a great resource and if you need to do some text processing on its content, it's very likely that Behemoth and that at least one of its existing modules should be useful.