DigitalPebble's Blog: tika

Showing posts with label tika. Show all posts

Thursday, 16 January 2020

What's new in StormCrawler 1.16?

Happy new year!

StormCrawler 1.16 was released a couple of days ago. You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/26?closed=1

As usual, we recommend that all users upgrade to this version as it contains important fixes and performance improvements.

Dependency upgrades

Tika 1.23 (#771)
ES 7.5.0 (#770)
jackson-databind from 2.9.9.2 to 2.9.10.1 dependency (#767)

Core

OKHttp configure authentication for proxies (#751)
Make URLBuffer configurable + AbstractURLBuffer uses URLPartitioner (#754)
/bugfix/ okhttp protocol: reliably mark trimmed content because of content limit (#757)
/!breaking!/ urlbuffer code in a separate package + 2 new implementations (#764)
Crawl-delay handling: allow `fetcher.max.crawl.delay` exceed 300 sec.(#768)
okhttp protocol: HTTP request header lacks protocol name and version (#775)
Locking mechanism for Metadata objects (#781)

LangID

/bugfix/ langID parse filter gets stuck (#758)

Elasticsearch

/bugfix/ Fix NullPointerException in JSONResourceWrappers (#760)
ES specify field used for grouping the URLs explicitly in mapping (#761)
Use search after for pagination in HybridSpout (#762)
Filter queries in ES can be defined as lists (#765)
es.status.bucket.sort.field can take a list of values (#766)
Archetype for SC+Elasticsearch (#773)
ES merge seed injection into crawl topology (#778)
Kibana - change format of templates to ndjson (#780)
/bugfix/ HybridSpout get key for results when prefixed by "metadata." (#782)
AggregationSpout to store sortValues for the last result of each bucket (#783)
Import Kibana dashboards using the API (#785)
Include Kibana script and resources in ES archetype (#786)

One of the main improvements in 1.16 is the addition of a Maven archetype to generate a crawl topology using Elasticsearch as a backend (#773). This is done by calling

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-elasticsearch-archetype -DarchetypeVersion=LATEST

The generated project also contains a script and resources to load templates into Kibana.

The topology for Elasticsearch now includes the injection of seeds from a file, which was previously in a separate topology. These changes should help beginners get started with StormCrawler.

The previous release included URLBuffers, with just one simple implementation. Two new implementations have been added in #764. The brand new PriorityURLBuffer sorts the buckets by the number of acks they got since the last sort whereas the SchedulingURLBuffer tries to guess when a queue should release a URL based on how long it took its previous URLs to be acked on average. The former has been used extensively with the HybridSpout but the latter is still experimental.

Finally, we added a soft locking mechanism to Metadata (#781) to help trace the source of ConcurrentModificationExceptions. If you are experiencing such exceptions, calling metadata.lock() when emitting e.g.

collector.emit(StatusStreamName, tuple, new Values(url, metadata.lock(), Status.FETCHED))

will trigger an exception whenever the metadata object is modified somewhere else. You might need to call unlock() in the subsequent bolts.

This does not change the way the Metadata works but is just there to help you debug.

Hopefully, we should be able to release 2.0 in the next few months. In the meantime, happy crawling and a massive thank you to all contributors!

Monday, 13 May 2019

What's new in StormCrawler 1.14

StormCrawler 1.14 was released yesterday and as usual, contains loads of improvements and bugfixes.

You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/24?closed=1

This release contains a number of breaking changes, mostly related to the move to Elasticsearch 7. We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.

Dependency upgrades

crawler-commons 1.0 #693
okhttp 3.14.0 #692
guava 27.1 (#702)
icu4j 64.1 #702)
httpclient 4.5.8 #702)
Snakeyaml 1.24 #702)
wiremock 2.22.0 #702)
rometools 1.12.0 #702)
Elasticsearch 7.0.0 (#708)

Core

Track how long a spout has been without any URLs in its buffer (#685)
Change ack mechanism for StatusUpdaterBolts (#689)
Robots URL filter to get instructions from cache only (#700)
Allow indexing under canonical URL if in the same domain, not just host (#703)
/bugfix/ URLs ending with a space are fetched over and over again (#704)
ParseFilter to normalise the mime-type of documents into simple values (#707)
Robot rules should check the cache in case of a redirection (#709)
/bugfix/ Fix the logic around sitemap = false (#710)
Reduce logging of exceptions in FetcherBolt (#719)

Elasticsearch

Asynchronous spouts (i.e ES) can send queries after max delay since previous one ended (#683)
StatusUpdaterBolt to load config from non-default param names (#687)
Add a ScrollSpout to read all the documents from a shard (#688 and #690) - see in our guest post how this can be used to reindex a status index.
ES IndexerBolt : check success of batches before acking tuples (#647)
/bugfix/ URLs with content that breaks ES get refetched over and over again (#705)
/bugfix/ URLs without valid host name (and routing) stay DISCOVERED forever (#706)
/bugfix/ ESSeedInjector: no URLs injected because URL filter does not subscribe to status stream (#715)
MetricsConsumer to include topology ID in metrics(#714)

WARC

Generate WARC request records (#509)
WARC format improvements (#691)

Tika

Set mimetype whitelist for Tika Parser (#712)

*********

I will be running a workshop on StormCrawler next month at the Web Archiving Conference in Zagreb and give a presentation jointly with Sebastian Nagel of CommonCrawl. I will come with loads of presents generously given by our friends at Elastic.

As usual, thanks to all contributors and users.

Happy crawling!

Friday, 25 May 2018

What's new in StormCrawler 1.9

Dependency upgrades

OKHttp 3.10.0 #546
JSoup 1.11.2 #552
icu4j 61.1 #556
Rometools 1.9.0 #556
HTTPClient 4.5.5 #558
Tika 1.18 #566

Core

Crawl-delay in robots.txt should optionally not shrink the configured delay #549
Optimisation: faster extraction of META tags #553
CollectionMetric synchronized access to List #555
Configurable Robots Caches #557
JSOUPParserBolt: lazy DOM conversion #563
Purge internal queues of tuples which have already reached timeout #564
Added ParseFilter to convert single valued Metadata to multi-valued ones #571
Caching of redirected robots.txt may overwrite correct robots.txt rules, fixes #573

WARC

WARCBolt to handle incorrect URIs gracefully #560
WARCRecordFormat use ByteBuffer instead of ByteArrayOutputStream #561

Archetype

Uses flux-core 1.2.1 #559
Added FeedParser to archetype topology #551
Added .kml and .wmv to url filters

SOLR

MetricsConsumer handles recursive values #554

Elasticsearch

MetricsConsumer handles recursive values #554
ES Indexer and Deletion Bolts to get index name from constructor #572

LanguageID

Added option to LanguageID to skip if metadata already set #570

As usual, we advise all users to move to this version as it fixes several bugs. Thanks to all contributors and users. Happy crawling!

Tuesday, 20 March 2018

What's new in StormCrawler 1.8

I have just released StormCrawler 1.8. As usual, here is a summary of the main changes:

Dependency updates

Storm 1.2.1 #531
SOLR 7.2.1 #528
Tika 1.17 #518
Elasticsearch 6.2.2 #525 and #539

Core

Add option to send only N bytes of text to indexers #476
BasicURLNormalizer to optionally convert IDN host names to ASCII/Punycode #522
MemorySpout to generate tuples with DISCOVERED status #529
OKHttp configure type of proxy #530
http.content.limit inconsistent default to -1 #534
Track time spent in the FetcherBolt queues #535
Increase detect.charset.maxlength default value #537
FeedParserBolt: metadata added by parse filters not passed forward in topology #541
Use UTF-8 for input encoding of seeds (FileSpout) #542
Default URL filter: exclude localhost and private address spaces #543
URLStreamGrouping returns the taskIDs and not their index #547

WARC

Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #520

SOLR

Schema for status index needs date type for nextFetchDate #544
SOLR indexer: use field type text for content field #545

Elasticsearch

AggregationSpout fails with default value of es.status.bucket.field == _routing #521
Move to Elasticsearch RESTAPi #539

We recommend all users to move to this version as it fixes several bugs (#541, #547) and adds some great new features. In particular, the use of the REST API for Elasticsearch, which makes the module future-proof but also easier to configure, but also #535 and #543.

As usual, thanks to all contributors and users. Happy crawling!

Monday, 31 October 2016

What's new in StormCrawler 1.2

StormCrawler 1.2 has been released today after a busy and exciting month, the highlight of which was certainly the announcement by CommonCrawl of the news dataset, which is powered by StormCrawler. This helped raise the profile of the project and also brought various improvements to the WARC and Elasticsearch modules (see below). Another great news was that my talk got accepted for ApacheCon BigData next month in Seville which prompted a Q&A interview on Linux.com.

Back to the content of the release. There have been many improvements on various levels, the main one being that the WARC module was moved to the main repository [#313]. It got many bugfixes and improvements since used by CommonCrawl and is now stable enough to join the other external modules.

We recommend all users to upgrade their configuration to the version 1.2 of StormCrawler.

Apart from minor bug fixes, the main changes in this new version are :

Core

Removes StatusStreamBolt [#341]

New Parse Filters :

MD5 signature [#354]
DomainParseFilter [#356]

URL Filters

URL Normalisation - remove parameters where the value is a 32-bit hash [#363]
Filtering : treat path parameters as query parameters [#366]
BasicURLFilter to remove URLs based on path repetition and max length [#368]

Add metadata.discoveryDate field to enable tracking discovery rate [#360]

Add super class for bolts using the status stream [#353]

JSoup Handle redirections via meta tag [#350]

Tika

Upgraded to Tika 1.13 [#285]

Combine JSoupParser with Tika [#357]

Tika parser can now parse embedded documents [#358]

Elasticsearch

Elasticsearch upgraded to 2.4.1

Metadata keys with multiple values not indexed correctly in ES [#345]

Refactoring into AbstractSpout for ES [#348]

Status index - fields stored unnecessarily [#351]

Cache URLs post ack/fail [#349]

The latter is a substantial change to the way the Elasticsearch spouts work. All 3 flavours of spouts hold a cache of the URLs being processed and use it to make sure that any URLs returned by a query are not added twice. This worked OK but did not cater for situations where a URL was towards the bottom of the buffer and acked/failed not long before the buffer was refilled from ES. In such cases, the changes to the status index hadn't had the time to be committed to the underlying index and as a result, the same URL was returned in the next query. This resulted in 10 to 15% of URLs being unnecessarily re-fetched in a short delay. What #349 does is that after ack/failing, URLs are kept in the cache for an extra N-seconds, to give time for the changed to be reflected in the search results (this is of course configurable via es.status.ttl.purgatory).

Coming next?

The releases seem to come more and more frequently. It is not sure yet what the next one will have in store but I am sure the discussions at ApacheCon as well as constant stream of new users will provide new functionalities and bugfixes.

In the meantime and as usual, thanks to all contributors and users and happy crawling!

Friday, 10 June 2016

What's new in StormCrawler 0.10?

The version 0.10 of Storm-Crawler has just been released. It contains many improvements and bugfixes and we recommend all existing users to upgrade to it.

Here are the main changes :

Core

Apart from the usual dependency upgrades (Apache Tika 1.12, Jsoup 1.8.3) and various bugfixes (notably #280 and 293), we completely removed the configuration files (parse and URL filters) from the core module (#227) as these should be specified by the user or provided by the archetype. This fixes an old issue we were having with user files being possibly overwritten with the ones provided by the core jar when generating the big jar.

We also added a mechanism to allow custom scheduling of the URLs based on metadata (#283). This applies to fetched documents only for the time being and can be used to revisit some pages more frequently depending on their nature, for instance news feeds (see below).

It is now possible to configure a list of metadata to persist (metadata.persist) into the status storage but not transfer to the outlinks of a document (#293). This is useful for the _redirTo values that we now generate to track the target of redirections (#96). Until now this would have been passed on to the outlinks, which would have been wrong.

The JSoupParser now has the option of passing on documents which it can't handle to the default stream so that another bolt can try to deal with them (#266). This can be useful e.g. to chain the JSoup parser - which generates a good DOM for XPath extraction, with the Tika one which gives rubbish DOM but can handle all sorts of file formats. JSoupParser also generates more complete DOMs, including inline Javascript and css nodes (#219).

New resources

LinkParseFilter takes Xpath expressions in the parsefilter configuration and allows to add the matching elements to the outlinks.

FeedParserBolt uses the ROME library to process news feeds and generate the outlinks accordingly.

There is a separate repository for resources to generate WARC files. These might be moved to the storm-crawler repository later on.

Elasticsearch

There has been loads of work done in this module. The main thing is that we upgraded the code to the current version 2.3.1 of Elasticsearch (#275), which should provide better performance and new functionalities. The Kibana dashboards have also been improved and can display generic metrics for Apache Storm (receive queues, memory heaps), top hosts/domains on status dashboard (see below) and total bytes fetched on the crawl metrics dashboard.

Kibana dashboard showing the status counts and top hostnames

What's next?

There is already a branch to move the code to Storm 1.x and some related improvements (#278). We will keep improving the existing components, in particular add sampling to the AggregationSpout for Elasticsearch.

Thanks to all users and contributors who helped with this release. Remember that you can follow the project on Twitter @stormcrawlerapi.

Friday, 5 June 2015

What's new in Storm-Crawler 0.5

We've just released the version 0.5 of Storm-Crawler, just over three months after the previous one. As you can read below, we've been pretty busy! The project got some great contributions from new users and is seeing an increase in adoption, which is very encouraging.

Metadata and Outlinks

One of the main improvements provided in the new release is the introduction of a Metadata object which replaces the Map<String,String[]> that were used everywhere in our code as well as the KeyValues utility class which manipulated such Maps. This makes the code a lot simpler and more elegant.

A new MetadataTransfer class has been added to (a) determine what metadata should be kept e.g. when persisting the information about a URL in a StatusUpdaterBolt but also to (b) determine what metadata should be transferred from the source document to its outlinks. This is a very useful feature that gets used quite often in practice.

Speaking of outlinks, they now have a proper class for representing them where the anchor and metadata for a given target URL are kept. Note that the parser bolts populate the metadata using the MetadataTransfer class described above prior to passing them to the ParseFilters. This means that a given ParseFilter can modify the outlinks for a given page or create completely new ones.

JSoupParserBolt

We got a present from our committer Gui whose company have kindly donated the a parsing bolt based on JSoup. This is now the one we use by default, the one based on Tika has been moved to the external part of the code. If you are crawling non-HTML pages then you should be using the Tika-based parser, otherwise the JSoup one is a lot lighter (both in code and dependencies) and works better when extracting data with XPath.

Abstract classes for persistence

We also added many useful resources for writing recursive crawlers, in addition to the status stream that came with the previous release. These can be found in the com.digitalpebble.storm.crawler.persistence package. In particular, we added a new AbstractStatusUpdaterBolt class. As the name suggests, this is meant to be extended to store the tuples coming from the status stream into some sort of storage (e.g. Elasticsearch, SOLR, Cassandra, HBase etc...). The abstract class keeps an internal cache for newly discovered URLs so that the same URL does not get updated more than once in the backend. Obviously this cache would not outlive the bolt if it died so this should be seen merely as an optimization and not a 100% reliable filter. The abstract class then calls a Scheduler, which is a pluggable mechanism to define when a given URL should be fetched next based on its metadata and status. The default scheduler simply relies on the configuration set by the user.

We also added a new AbstractIndexerBolt class, to simplify writing indexing bolts by allowing the users to specify what metadata to index via the configuration. This greatly simplifies writing an IndexerBolt.

Elasticsearch

These new classes above have been used for our Elasticsearch bolts and spout. We now have :

IndexerBolt (extending AbstractIndexerBolt) to index the content of the documents
StatusUpdaterBolt (extending AbstractStatusUpdaterBolt) to persist the status and metadata for URLs

ElasticSearchSpout to read from the status index

These 3 components allow to build a recursive crawler with Elasticsearch. We also added an example topology to illustrate how to do this as well as an init script which defines the schemas of the indices.

As a bonus we wrote a MetricsConsumer which can be plugged into the Storm metrics mechanism so that they get indexed in Elasticsearch. This would be typically used by Kibana as a way of monitoring the performance of the crawler with the metrics generated by the spouts and bolts e.g. bytes per second, pages fetched etc... I had suggested it to the elasticsearch-hadoop community but it hasn't attracted much interest so far.

We will probably provide a schema file for Kibana so that users can load a standard dashboard for displaying the metrics. We just need to wait for the next release of Kibana which will contain #1552.

Miscellaneous and next steps

We've replaced the old http protocol implementation we'd borrowed from Nutch with a brand new one based on Apache HttpClient. Less code to maintain and it is also more robust, particularly on https pages.

Apart from that, we improved our WIKI pages, upgraded some dependencies (Tika to 1.8, ES to 1.5.1, Storm to 0.9.4), added some resources (e.g. MaxDepthFilter), removed some deprecated ones (#126) and fixed numerous bugs.

As I said, we've been pretty busy and it looks like this is set to continue with the 0.6 release. It will probably contain #117 as well as resources for Apache SOLR.

Thanks to everyone who contributed to this release in any way.

Friday, 28 November 2014

Generating a test corpus for Apache Tika from CommonCrawl : Behemoth to the rescue!

It's been a while since I last blogged, in particular about Behemoth. For those who don't know about it, Behemoth is an open source project under Apache license which helps with large scale processing of documents by providing wrappers for various libraries (Tika, UIMA, GATE etc...), a common document representation used by these wrappers and some utility classes for manipulating the dataset generated by it. Behemoth runs on Hadoop and has been used in various projects over the years. I have started working on an equivalent for Apache Spark called Azazello (to continue with the same litterary reference) but it is still early days.

I have been using Behemoth in the last couple of days to help with TIKA-1302. What we are trying to do there is to have as large a test dataset as possible for Tika. We thought it would be interesting to use data from Common Crawl, in order to get (1) loads of it (2) things seen in the wild (3) various formats.

Behemoth steps

Luckily Behemoth can process WARC files such as the ones generated by Common Crawl with its IO module. Assuming you have cloned the source code of Behemoth, compiled it with Maven and have Hadoop installed all you need to do is call :

hadoop jar io/target/behemoth-io-*-SNAPSHOT-job.jar com.digitalpebble.behemoth.io.warc.WARCConverterJob -D fs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY -D fs.s3n.awsSecretAccessKey=$AWS_SECRET_KEY -D document.filter.mimetype.keep=.+[^html] s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00099-ip-10-60-113-184.ec2.internal.warc.gz behemothCorpus

Please note the document.filter.mimetype.keep=.+[^html] parameter, this allows us to filter the input documents and keep only the ones that do not have html in their mimetype (as returned by the web servers).

The command above will generate a Hadoop sequence file containing serialized BehemothDocuments. The reader command can be used to have a peek at the content of the corpus e.g.

./behemoth reader -i behemothCorpus -m | more

url: http://0.static.wix.com/dicons/7ffb03_2f63cf23ec1107e4ed9824f6c98e5847.wix_doc_ico

contentType: image/jpeg

metadata:

Date: Tue, 21 Oct 2014 08:46:31 GMT

ETag: "6afff623058a88cb23a5b18c934ee8fd19192"

Server: s23.tam

Content-Type: image/jpeg

Connection: close

Content-Length: 19192

Cache-Control: max-age=604800

X-Seen-By: s23.tam_pp

IP: 207.36.47.4

[...]

The option -m displays the metadata, we could also display the binary content if we wanted to. See WIKI for the options available.

The next step is to generate an archive with the content of each file, for which we have the generic exporter command :

./behemoth exporter -i $segName -n URL -o file:///mnt/$segName -b

This gives us a a number of archives with the content of each document put in a separate file, with the URL used for its name. We can then push the resulting archives to the machine used for testing Tika.

Scaling with Amazon EMR

The commands above will work fine even on a laptop but since we are interested in processing a substantial amount of data we need a real Hadoop cluster.

I started a smallish 5 nodes Hadoop cluster with EMR, SSHed to the master, git cloned Behemoth, compiled it and pushed the segment URLs from the latest release of Common Crawl into a SQS queue then wrote a small script which pulls the segment URLs from the queue one by one, calls the WARCConverterJob then the exporter before pushing the archives to the machine used for testing Tika. The latter step is a bit of a bottleneck as it writes to the local filesystem on the master node.

On a typical segment (like s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-42/segments/1413507444312.8/) we filtered 30,369,012 and kept 431,546 documents. The top mimetypes look like this :

166208 contentType: image/jpeg

63097 contentType: application/pdf

58531 contentType: text/plain

38497 contentType: image/png

28906 contentType: text/calendar

10162 contentType: image/gif

7005 contentType: audio/x-wav

6604 contentType: application/json

3136 contentType: text/HTML

2932 contentType: unknown/unknown

2799 contentType: video/x-ms-asf

2609 contentType: image/jpg

1868 contentType: application/zip

1798 contentType: application/msword

The regular expression we used to filter the html documents did not take the uppercase variants into account : nevermind, it still removed most of them.

What next?

One alternative to pushing the archives to an external server would be to run the tests with Behemoth, since it has an existing wrapper for Tika. This would make the tests completely scalable and we'd also be able to use the extra information available in the BehemothDocuments such as the mime-type returned by the servers.

We'll see how this dataset gets used in TIKA-1302. There are many ways in which Behemoth can be used and it has quite a few modules available. The aim of this blog was to show how easy it is to process data on a large scale with it, with or without the CommonCrawl dataset.

By the way CommonCrawl is a great resource, please support it by donating if you can (http://commoncrawl.org/donate/).