DigitalPebble's Blog
Showing posts with label behemoth. Show all posts
Showing posts with label behemoth. Show all posts
Wednesday, 29 March 2017

Need billions of web pages? Don't bother crawling...

›
How big did you say? I am often contacted by prospective clients to help them crawl the web on a very large scale or find questions such...
Friday, 28 November 2014

Generating a test corpus for Apache Tika from CommonCrawl : Behemoth to the rescue!

›
It's been a while since I last blogged, in particular about Behemoth .  For those who don't know about it, Behemoth is an open sou...
Wednesday, 5 September 2012

Using Behemoth on the CommonCrawl dataset

›
Behemoth is an open-source platform for document processing based on Hadoop which provides an excellent way to process document collection...
4 comments:
›
Home
View web version
Powered by Blogger.