Showing posts with label tika. Show all posts
Showing posts with label tika. Show all posts

Wednesday 5 September 2012

Using Behemoth on the CommonCrawl dataset

Behemoth is an open-source platform for document processing based on Hadoop which provides an excellent way to process document collections on a large scale, such as crawled pages obtained with Nutch or CommonCrawl. 

Today, we are going to use a segment of the CommonCrawl dataset and show how to import data in Behemoth, filter on some common attributes and generate vectors for clustering with Apache Mahout.

CommonCrawl 

The CommonCrawl dataset  (http://commoncrawl.org/) is an open repository of web crawl data comprising 3.8 billion documents that are universally accessible. The data is available in different formats, the most recent one separating the raw content (ARC) from the metadata in JSON and the text (HTML only).

The ARC and text formats can be handled by the CommonCrawl module in Behemoth. 
This module converts CommonCrawl data to SequenceFiles of BehemothDocuments.The difference between the documents obtained in one format or the other lies in what is added in the BehemothDocs, which is binary content for the ARC and text for the text format.

In order to access this source, you will need to get an AWS (Amazon Web Services) account, because using this data is non-free.

What you need to set up for this step:
https://github.com/DigitalPebble/behemoth
https://github.com/DigitalPebble/behemoth-commoncrawl



Getting the data

Once Behemoth and its module for CommonCrawl have been installed, we can go to the command line and “cd” into the behemoth-commoncrawl folder: 

We get the data from CommonCrawl and convert it into a Behemoth corpus: 

hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.commoncrawl.CommonCrawlConverterJob2012 -D fs.s3n.awsAccessKeyId={YOUR_OWN_ID} fs.s3n.awsSecretAccessKey={YOUR_OWN_KEY}   -D document.filter.mimetype.keep=application/pdf s3n://aws-publicdatasets/common-crawl/parse-output/segment/1350433107106/* test-crawlpdf  

  
In this example, we filter on the mime type, since we only want to import pdf documents.
By setting the filter: -D document.filter.mimetype.keep=application/pdf, we limit what is imported from CommonCrawl. The filter takes a regular expression and will import only those documents whose mime type matches the regular expression. Note that it is possible to filter based on other things such as the URL, the length of the document or any other metadata.


In order to inspect, what has just been imported, we can now call the CorpusReader and look at the content of the Behemoth sequence file.  



hadoop jar ./target/behemoth-commoncrawl-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i test-crawlpdf -c -t


The document corpus excerpt gives you some information on the source of the document, the content type and shows the first lines of the binary content (parameter -c). Note the parameter -t which displays the text for the document, however since the documents were generated from the ARC, the text has not been extracted yet. 

Output after getting the data:

url: http://www.harrahsrincon.com/images/non_image_assets/RIN_New_spa_menu_web.pdf
contentType
: application/pdf
Content
:
%PDF-1.6
%����
101 0 obj <</Linearized 1/L 527200/O 104/E 88751/N 12/T 525137/H [ 736 441]>>
endobj
            
xref
101 22
0000000016 00000 n
0000001177 00000 n
0000001293 00000 n
0000001418 0


Text Extraction

To then obtain the text, we use the Tika module in Behemoth which extracts the text from the documents in a Behemoth sequence file.

hadoop jar ./behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver -i test-crawlpdf/* -o crawlpdf-Tika

Now, we again inspect the corpus and see the extracted text content (omitting the parameter -c)

Output after extracting the text content:

url: http://www.harrahsrincon.com/images/non_image_assets/RIN_New_spa_menu_web.pdf
contentType: application/pdf
Text:

777 Harrah’s Rincon Way  
Valley Center, CA 92082  

760-751-7709
www.harrahsrincon.com

Prices, hours of operation and treatments are subject to change.
Must be 21 or older to gamble. 



While Tika extracts the text content, it also generates annotations representing the original markup of a document (if present) and its metadata, which can be displayed with the parameters -m and -a.

Filtering on Language

Since, for the sake of argument,  we are only interested in the English documents in the corpus, we need to filter out all those which are in a different language. The language identification module uses the LangDetect library (http://code.google.com/p/language-detection/) to identify and add language IDs to each document.

We identify the language with:

(1) hadoop jar ./behemoth-lang*job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -i crawlpdf-Tika -o crawlpdf-Tika-lang
 

From the command line output or the hadoop jobtracker, one can see the distribution of languages in a corpus: 




Then we can filter on the language ID, in this case 'en' - for  English.

After having identified the languages, the filtering can be done either by using the CorpusFilter from the core module:

hadoop jar behemoth-core*-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusFilter -D document.filter.md.keep.lang=en -i crawlpdf-Tika-lang -o crawlpdf-Tika-EN

Note that we could have done the same as part of the language identification step with :

(2) hadoop jar behemoth-lang*-SNAPSHOT-job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -D document.filter.md.keep.lang=en -i crawlpdf-Tika -o crawlpdf-Tika-EN

If you are only interested in filtering, the first step shown here is optional - the identification and filtering can be done in one step as shown in (2). The corresponding jobtracker output would look like this: 

 

Clustering


Having filtered out all unwanted documents, we create the vectors representing the Behemoth documents, thanks to the resources in the Mahout module : 


hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth -i crawlpdf-Tika-EN -o crawl-pdf-vec --namedVector
 

Having successfully finished the preprocessing and vector generation with Behemoth, we now change to Mahout (available here) to do the clustering.

Using kmeans clustering in Mahout, there are two ways of generating the initial clusters:


1) One can specify the desired number of output clusters and the initial centroids are generated as a first step in kmeans. This will probably be best, if you do not know your data very well, but do know how many clusters you want to have in the end. 

2) Another option is to use canopy clustering, where you define a minimal distance between the centroids and the number of clusters depends on that distance and obviously also on the distance measure used.
There are ways to calculate the average distance between vectors in your corpus beforehand: 
Using more appropriate values for the distance will probably give a more representative clustering result. 

From your mahout folder: 

Thus, creating the initial centroids with canopy clustering:

mahout canopy -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/canopy-centroids -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.1 -t2 0.5 -cl

Then you call kmeans, while specifying the newly-generated canopy-centroids in the c-argument. The distance measure used here is Tanimoto, which takes into account the document length.

mahout kmeans -i crawl-pdf-vec/tfidf-vectors -o crawl-pdf-vec/clusters -c crawl-pdf-vec/canopy-centroids/clusters-0-final  -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -x 10 -cd 0.1 -cl


Since we’re interested in seeing what documents are allocated to which cluster, the ClusterDocIDDumper in the Mahout module in Behemoth comes in useful: 

hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.util.ClusterDocIDDumper -i crawl-pdf-vec/clusters/clusteredPoints  -o crawl-pdf-vec/clusterID

To extract the results to the local file system:

hadoop fs -text crawl-pdf-vec/clusterID > crawlpdf-clusterID


et voila:

….
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/1505OK.pdf    6
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/1511OK.pdf    37
http://hdmaster.com/testing/cnatesting/oklahoma/okformpages/okforms/OKVocablist.pdf    19
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1501OR.pdf    23
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1502OR.pdf    42
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1503OR.pdf    43
http://hdmaster.com/testing/cnatesting/oregon/orformpages/1511OR.pdf    44
http://hdmaster.com/testing/cnatesting/tennessee/tnformpages/tnforms/1402TN.pdf    10
....


Conclusion


This was merely an exercise meant to illustrate some of the capabilities of Behemoth and how it could be used to process the CommonCrawl dataset. There are  more modules  available, such as the GATE or UIMA ones that we could have used to extract named entities, or the SOLR module to index the documents. 


We actually used the CommonCrawl dataset with Behemoth for one of our clients in order to identify CVs automatically using our  text classification module alongside the Tika, GATE and Language ID modules. This was a great way of checking some of our assumptions before applying the same processes to the output of a Nutch crawl. CommonCrawl is a great resource and if you need to do some text processing on its content, it's very likely that Behemoth and that at least one of its existing modules should be useful. 















 

 

Friday 27 May 2011

Parsing the Enron email dataset using Tika and Hadoop

In order to parse a large collection of emails, such as the Enron Email Dataset, we might choose to use Apache Hadoop, a scalable computing framework, and Apache Tika, a content analysis toolkit. This can be done easily with Behemoth, an open source platform for large scale document analysis developed by DigitalPebble. For more details of Behemoth, see the Behemoth Tutorial.

Using the August 21, 2009 version of the dataset, the first step is to use Behemoth's CorpusGenerator to create a corpus of BehemothDocuments from the Enron Dataset in HDFS. A BehemothDocument is the native object used by Behemoth. At ingest, it contains the original document, its content type and URL. After processing by a Behemoth module, it also contains the extracted text, additional metadata and annotations created about the document.

Once the dataset has been ingested, the next step is to use the Behemoth Tika module to create a Hadoop Map/Reduce job to extract the contents of the emails and metadata about them. Using Apache Tika 0.9, 5% of the documents fail to parse correctly. However using the latest version of Tika (Tika-1.0-snapshot revision 825923) only 0.2% documents fail.

One way to investigate why parsing is failing is by looking at the user logs generated within Hadoop, which contain details of the exceptions causing the failing documents. An alternative way is to write a custom reducer that sorts the exceptions thrown by Tika, with the exception stack being used as a key and a document URL as values. With Tika revision 825923, four exceptions are thrown, caused by two underlying problems: excessive line lengths of over 10,000 characters, the current default in the Tika mail parser, and malformed dates. The first problem can be solved by increasing the maximum line length in a MimeEntityConfig object and then modifying TikaProcessor to pass it into the ParseContext.

As for the second problem, currently the mail parser in Tika performs strict parsing, i.e. parsing a document fails when parsing a field fails. Tika-667 contains a contribution that makes it possible to turn off strict parsing, so some data can still be extracted from the emails with the malformed dates. This can also be configured via MimeEntityConfig. When these changes are incorporated, all documents are processed correctly.

Saturday 19 March 2011

DigitalPebble is hiring!

We are looking for a candidate with the following skills and expertise :
  • strong background in NLP and Java
  • GATE, experience of writing plugins and PRs, excellent knowledge of JAPE
  • IE, Linked Data, Ontologies
  • statistical approaches and machine learning
  • large scale computing with Hadoop
  • knowledge of the following technologies / tools : Lucene, SOLR, NoSQL, Tika, UIMA, Mahout
  • good social and presentation skills
  • good spoken and written English, knowledge of other languages would be a plus
  • taste for challenges and problem solving

    DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering.

    More details on our activities can be found on our website. We would consider candidates working remotely with occasional travel to Bristol and our clients in UK and Europe. Being located in or near Bristol would be a plus.

    This job is an opportunity to get involved in the growth of a small company, work on interesting projects and take part in various Apache related projects and events. Bristol is also a great place to live.


   Please send your CV and cover letter before the 15th April 2011 to job@digitalpebble.com


    Best regards,

    Julien Nioche

Saturday 28 August 2010

Behemoth talk from BerlinBuzzwords 2010

The talk I gave on Behemoth at BerlinBuzzwords has been filmed (I do not dare watching it) and is available on http://blip.tv/file/3809855.

The slides can be found on http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/nioche_bbuzz2010.odp

The talk contains a quick demo of GATE and mentions Tika, UIMA and of course Hadoop.

Thursday 19 August 2010

Tika on FeatherCast

Apache Tika recently split off from the Lucene project and became a separate top level Apache project. Chris Mattmann is talking about what Tika is, and where it’s going on http://feathercast.org/?p=90

Friday 13 August 2010

Towards Nutch 2.0

Nevermind the dodgy look of the blog - I'll improve that later!

For my first post, I'd like to mention the progress we've made recently towards Apache Nutch 2.0. It is based on a branch named NutchBase which has been developed mainly by Doğacan Güney and it is now in the trunk of the SVN repository. One of the main aspects of Nutch 2.0 is that it is now storing its data in a datastore and not in Hadoop's file-based structures. Note that we still have the distribution and replication of the data over a whole cluster and data locality for MapReduce but we also have the possibility to insert or modify a random entry in the table without having to read/write the whole data structure as it was the case before.

Nutch uses a project named GORA as an intermediate between our code and the backend storage. There would be a lot of things to say on GORA but to make it short what we are trying to achieve with it is to make it a sort of common API for NoSQL stores. GORA already has implementations for HBase and Cassandra but also SQL. The plan for GORA is to put it in the Apache Incubator or possibly as an Apache subproject (Hadoop? HBase? Cassandra?). We'll see how it goes.

There are quite a few structural changes in Nutch, most notably the fact that there aren't any segments any more as all the information about a URL (metadata, original content, extracted text, ...) in stored in a single table which means for instance no more segments to merge or metadata to move back to the crawldb. It's all in one place!

There are other substantial changes in 2.0, notably the removal of the Lucene-based indexing and search as we now rely on SOLR. Other indexing backends might be added later. Another step towards delegating functionalities to external projects is the increased used of Apache Tika for the parsing. We've removed quite a few legacy parsers from Nutch and let Tika do the work for us. We've also revamped the organisation of the code and did a lot of code clean up.

Nutch 2.0 is still at an early stage and we are actively working on it, testing, debugging etc... The good news is that it is not only an architectural change but also a basis for a whole lot of new functionalities (see for instance https://issues.apache.org/jira/browse/NUTCH-882).

I'll keep you posted on our progress, as usual : give it a try, get involved, join us...