Behemoth is an open-source platform for document processing based on Hadoop which provides an excellent way to process document collections on a large scale, such as crawled pages obtained with Nutch or CommonCrawl.
Today,
we are going to use a segment of the CommonCrawl dataset and show how
to import data in Behemoth, filter on some common attributes and
generate vectors for clustering with Apache Mahout.
CommonCrawl
The CommonCrawl dataset (http://commoncrawl.org/) is an open repository of web crawl data comprising 3.8 billion documents that are universally accessible. The data is available in different formats, the most recent one separating the raw content (ARC) from the metadata in JSON and the text (HTML only).
The ARC and text formats can be handled by the CommonCrawl module in Behemoth.
This module converts CommonCrawl data to SequenceFiles of BehemothDocuments.The difference between the documents obtained in one format or the other lies in what is added in the BehemothDocs, which is binary content for the ARC and text for the text format.
In order to access this source, you will need to get an AWS (Amazon Web Services) account, because using this data is non-free.
What you need to set up for this step:
https://github.com/DigitalPebble/behemoth
https://github.com/DigitalPebble/behemoth-commoncrawl
Getting the data
Once Behemoth and its module for CommonCrawl have been installed, we can go to the command line and “cd” into the behemoth-commoncrawl folder:
We get the data from CommonCrawl and convert it into a Behemoth corpus:
hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.commoncrawl.CommonCrawlConverterJob2012 -D fs.s3n.awsAccessKeyId={YOUR_OWN_ID} fs.s3n.awsSecretAccessKey={YOUR_OWN_KEY} -D document.filter.mimetype.keep=application/pdf s3n://aws-publicdatasets/common-crawl/parse-output/segment/1350433107106/* test-crawlpdf
In this example, we filter on the mime type, since we only want to
import pdf documents.
By setting the filter: -D
document.filter.mimetype.keep=application/pdf, we limit what is imported from CommonCrawl. The filter takes a regular expression and will import only those documents whose mime type matches the regular expression. Note that it is possible to filter based on other things such as the URL, the length of the document or any other metadata.
In order to inspect, what has just been imported, we can now call the CorpusReader and look at the content of the Behemoth sequence file.
hadoop
jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar
com.digitalpebble.behemoth.util.CorpusReader -i test-crawlpdf -c -t
The document corpus excerpt gives you some information on the source of the document, the content type and shows the first lines of the binary content (parameter -c). Note the parameter -t which displays the text for the document, however since the documents were generated from the ARC, the text has not been extracted yet.
Output after getting the data:
url: http://www.harrahsrincon.com/images/non_image_assets/RIN_New_spa_menu_web.pdf
contentType: application/pdf
Content:
%PDF-1.6
%����
101 0 obj <</Linearized 1/L 527200/O 104/E 88751/N 12/T 525137/H [ 736 441]>>
endobj
xref
101 22
0000000016 00000 n
0000001177 00000 n
0000001293 00000 n
0000001418 0
Text Extraction
To then obtain the text, we use the Tika module in Behemoth which extracts the text from the documents in a Behemoth sequence
file.
hadoop
jar ./behemoth-tika-*-job.jar
com.digitalpebble.behemoth.tika.TikaDriver -i test-crawlpdf/* -o crawlpdf-Tika
Now, we again inspect the corpus and see the extracted text content (omitting the parameter -c)
Output after extracting the text content:
url: http://www.harrahsrincon.com/images/non_image_assets/RIN_New_spa_menu_web.pdf
contentType: application/pdf
Text:
777 Harrah’s Rincon Way
Valley Center, CA 92082
760-751-7709
www.harrahsrincon.com
Prices, hours of operation and treatments are subject to change.
Must be 21 or older to gamble.
While Tika extracts the text content, it also generates annotations representing
the original markup of a document (if present) and its metadata, which can be displayed with the parameters -m and -a.
Filtering on Language
Since, for the sake of argument, we are only interested in the English documents in the corpus, we need to filter out all those which are in a different language. The language identification module uses the LangDetect library (http://code.google.com/p/language-detection/) to identify and add language IDs to each document.
We identify the language with:
(1) hadoop
jar ./behemoth-lang*job.jar
com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -i
crawlpdf-Tika -o crawlpdf-Tika-lang
From the command line output or the hadoop jobtracker, one can see the distribution of languages in a corpus:
Then we can filter on the language ID, in this case 'en' - for English.
After having identified the languages, the filtering can be done either by using the CorpusFilter from the core module:
hadoop jar behemoth-core*-SNAPSHOT-job.jar
com.digitalpebble.behemoth.util.CorpusFilter -D
document.filter.md.keep.lang=en -i crawlpdf-Tika-lang -o crawlpdf-Tika-EN
Note that we could have done the same as part of the language identification step with :
(2) hadoop jar behemoth-lang*-SNAPSHOT-job.jar
com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -D
document.filter.md.keep.lang=en -i crawlpdf-Tika -o crawlpdf-Tika-EN
If you are only interested in filtering, the first step shown here is optional - the identification and filtering can be done in one step as shown in (2). The corresponding jobtracker output would look like this:
Clustering
Having filtered out all unwanted documents, we create the vectors representing the Behemoth documents, thanks to the resources in the Mahout module :
hadoop
jar ./behemoth-mahout*job.jar
com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth -i
crawlpdf-Tika-EN -o crawl-pdf-vec --namedVector
Having successfully finished the preprocessing and vector generation with Behemoth, we now change to Mahout (available here) to do the clustering.
Using kmeans clustering in Mahout, there are two ways of generating the initial clusters:
1) One can specify the desired number of output clusters and the initial
centroids are generated as a first step in kmeans. This will probably be
best, if you do not know your data very well, but do know how many
clusters you want to have in the end.
2)
Another option is to use canopy clustering, where you define a minimal
distance between the centroids and the number of clusters depends on
that distance and obviously also on the distance measure used.
There are ways to calculate the average distance between vectors in your corpus beforehand:
Using more appropriate values for the distance will probably give a more representative clustering result.
From your mahout folder:
Thus, creating the initial centroids with canopy clustering:
mahout
canopy -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/canopy-centroids
-dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.1
-t2 0.5 -cl
Then
you call kmeans, while specifying the newly-generated canopy-centroids
in the c-argument. The distance measure used here is Tanimoto, which takes into account the document length.
mahout
kmeans -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/clusters -c
crawl-pdf-vec/canopy-centroids/clusters-0-final -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure -x 10 -cd 0.1
-cl
Since we’re interested in seeing what documents are allocated to which cluster, the ClusterDocIDDumper in the Mahout module in Behemoth comes in useful:
hadoop
jar ./behemoth-mahout*job.jar
com.digitalpebble.behemoth.mahout.util.ClusterDocIDDumper -i
crawl-pdf-vec/clusters/clusteredPoints -o crawl-pdf-vec/clusterID
To extract the results to the local file system:
hadoop fs -text crawl-pdf-vec/clusterID > crawlpdf-clusterID
et voila:
….
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/1505OK.pdf 6
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/1511OK.pdf 37
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/OKVocablist.pdf 19
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1501OR.pdf 23
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1502OR.pdf 42
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1503OR.pdf 43
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1511OR.pdf 44
http://hdmaster.com/testing/cnatesting/tennessee/tnformpages/tnforms/1402TN.pdf 10
....
Conclusion
This was merely an exercise meant to illustrate some of the capabilities of Behemoth and how it could be used to process the CommonCrawl dataset. There are more modules available, such as the GATE or UIMA ones that we could have used to extract named entities, or the SOLR module to index the documents.
We actually used the CommonCrawl dataset with Behemoth for one of our clients in order to identify CVs automatically using our text classification module alongside the Tika, GATE and Language ID modules. This was a great way of checking some of our assumptions before applying the same processes to the output of a Nutch crawl. CommonCrawl is a great resource and if you need to do some text processing on its content, it's very likely that Behemoth and that at least one of its existing modules should be useful.