DigitalPebble's Blog

Friday, 25 May 2018

What's new in StormCrawler 1.9

Dependency upgrades

OKHttp 3.10.0 #546
JSoup 1.11.2 #552
icu4j 61.1 #556
Rometools 1.9.0 #556
HTTPClient 4.5.5 #558
Tika 1.18 #566

Core

Crawl-delay in robots.txt should optionally not shrink the configured delay #549
Optimisation: faster extraction of META tags #553
CollectionMetric synchronized access to List #555
Configurable Robots Caches #557
JSOUPParserBolt: lazy DOM conversion #563
Purge internal queues of tuples which have already reached timeout #564
Added ParseFilter to convert single valued Metadata to multi-valued ones #571
Caching of redirected robots.txt may overwrite correct robots.txt rules, fixes #573

WARC

WARCBolt to handle incorrect URIs gracefully #560
WARCRecordFormat use ByteBuffer instead of ByteArrayOutputStream #561

Archetype

Uses flux-core 1.2.1 #559
Added FeedParser to archetype topology #551
Added .kml and .wmv to url filters

SOLR

MetricsConsumer handles recursive values #554

Elasticsearch

MetricsConsumer handles recursive values #554
ES Indexer and Deletion Bolts to get index name from constructor #572

LanguageID

Added option to LanguageID to skip if metadata already set #570

As usual, we advise all users to move to this version as it fixes several bugs. Thanks to all contributors and users. Happy crawling!

Friday, 23 March 2018

Grafana StormCrawler metrics v4

The Grafana dashboard for StormCrawler is a good starting point for monitoring the behaviour of your StormCrawler topology. This is typically used with Elasticsearch as a storage backend for the metrics generated by Storm but should work with any other Storm-compatible backend like Grafite or CloudWatch.

Some of the metrics are specific to the components from the Elasticsearch module (spout, status, indexer) but you can simply remove or modify them if you use e.g. SOLR (NOTE: there was a feature request in Grafana to add SOLR as a datasource but to my knowledge, this is not yet available).

The latest version (4) brings the following changes.

URLs waiting in queues

The recent 1.8 release of StormCrawler added a new metrics for the FetcherBolt which allows tracking the amount of time URLs spend in the internal queues. This has been added to the "URLs waiting in queues" panel alongside the average population of the queues.

Average time spent in queues + average queues population

ES StatusUpdater

Instead of tracking the number of bulk requests sent in the last minute, we now have a panel showing the evolution over time. This information is for the ES StatusUpdaterBolt only.

ES status updater bulk requests

Acked in StatusBolt

This is a brand new panel which is not specific to Elasticsearch but operates on any component with 'status' for id and shows the number of tuples acked over time, broken down by source.

Tuples acked by StatusUpdater

In the graph above, we can see a peak early in the crawl where most of the tuples acked came from the sitemap bolt. Please note that the values are stacked in this graph. Sitemap files are typically discovered early in a crawl and generate a large number of discovered URLs; this is not the case later on when most tuples come from the HTML parser.

Robots panel

We removed the robots panel as the number of HTTP requests to robots files is shown in the "Fetcher: pages fetched" panel anyway and after the initial few minutes of a crawl, the panel simply indicated that the robots files were mostly cached.

ES Indexed

This is a new panel showing the number of documents indexed into Elasticsearch as well as the documents filtered out during the indexing.

Tuesday, 20 March 2018

What's new in StormCrawler 1.8

I have just released StormCrawler 1.8. As usual, here is a summary of the main changes:

Dependency updates

Storm 1.2.1 #531
SOLR 7.2.1 #528
Tika 1.17 #518
Elasticsearch 6.2.2 #525 and #539

Core

Add option to send only N bytes of text to indexers #476
BasicURLNormalizer to optionally convert IDN host names to ASCII/Punycode #522
MemorySpout to generate tuples with DISCOVERED status #529
OKHttp configure type of proxy #530
http.content.limit inconsistent default to -1 #534
Track time spent in the FetcherBolt queues #535
Increase detect.charset.maxlength default value #537
FeedParserBolt: metadata added by parse filters not passed forward in topology #541
Use UTF-8 for input encoding of seeds (FileSpout) #542
Default URL filter: exclude localhost and private address spaces #543
URLStreamGrouping returns the taskIDs and not their index #547

WARC

Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #520

SOLR

Schema for status index needs date type for nextFetchDate #544
SOLR indexer: use field type text for content field #545

Elasticsearch

AggregationSpout fails with default value of es.status.bucket.field == _routing #521
Move to Elasticsearch RESTAPi #539

We recommend all users to move to this version as it fixes several bugs (#541, #547) and adds some great new features. In particular, the use of the REST API for Elasticsearch, which makes the module future-proof but also easier to configure, but also #535 and #543.

As usual, thanks to all contributors and users. Happy crawling!

Tuesday, 28 November 2017

What's new in StormCrawler 1.7

Amazingly this is the 20th release of StormCrawler! Here are the main changes:

Dependencies updates

crawler-commons 0.9 #513

Core

(bugfix) ParserBolts should use outlinks from parsefilters #498
LD_JSON parsefilter #501
okhttp : store request and response headers verbatim in metadata #506
(bugfix) okhttp protocol does not store headers in metadata #507
HTTP clients should handle http.accept.language and http.accept #499
Selenium protocol follows redirections #514
RemoteDriverProtocol needs multiple instances #505
SitemapParserBolt should force mime-type based on the clue #515

Elasticsearch

ES Spout : define filter query via config #502
Upgrade to ES 6.0 #517

We recommend all users to move to this version. If you wish to remain on an older version of Elasticsearch, you can simply keep your existing version of the stormcrawler elasticsearch module while upgrading stormcrawler core.

This version improves the processing of sitemaps, via #515 and the use of the crawler-commons 0.9 where we fixed the SAX parsing and extended its coverage. We also added improvements to our okhttp-based protocol implementation. If your crawl is a wide one with potentially any sort of content then you should go for okhttp over the default httpclient one. See our comparison of protocol implementations on the WIKI.

Finally, if you want to extract semantic data represented in ld-json then you'll love #501.

As usual, thanks to all contributors and users. Happy crawling!

Friday, 8 September 2017

What's new in StormCrawler 1.6

Dependencies updates

jsoup 1.10.3
crawler-commons 0.8

Core

Use ISO representation of time for modifiedtime in adaptivescheduler #496
Use ISO representation of time for discoveryDate and lastProcessedDate, #477
Improved Charset Detection #495
SitemapParserBolt configure use SAX or not
SitemapParserBolt generates metrics for average processing time
HTTP protocol based on OKHTTP #484
Apache Http client can use HEAD method on a per URL basis #485
ContentFilter to leave trace of the pattern that matched #480
Metadata has a new public method for getting first non-empty value from a set of keys
Added ARTICLE to patterns for content filter

LangID

Can add more than one lang code based on configurable prob threshold. #481

WARC

Added rotation policy based on time and filesize

ES: added es.status.reset.fetchdate.after #478
Removed Grafana resources - can be downloaded from Grafana portal

Monday, 29 May 2017

What's new in StormCrawler 1.5

StormCrawler 1.5 has just been released! It is an important road mark with the move to Elasticsearch 5.x and the implementation of long-awaited features such as the Selenium-based protocol. The code has been improved in many ways and despite the seemingly low number of lines below, this new release is a mammoth one!

The project, in general, is in very good health, with more and more organisations using it in production, and an increased visibility, reflected by the growing number of questions on StackOverflow.

Here are the main changes in 1.5.

CORE DEPENDENCIES UPGRADES

Apache Storm 1.1.0 (#450)

CORE MODULE

HTTP Protocol: implement cookie handling (#32)
java.util.zip.ZipException: Not in GZIP format thrown on redirs with httpclient (#455)
Selenium-based protocol implementation (#144) which I described in a separate blog post
Indicate whether RobotsRules come from cache or have been fetched (#460)
Memory issues when ByteArrayBuffer gets instantiated with a large value despite maxLength being set (#462)
FetcherBolt to dump URLs being fetched to log (#464)
Override sitemapsAutoDiscovery settings per URL (#469)

Knowing whether RobotsRules come from the cache gives us more insights into the behaviour of the crawlers as we can display the ratio of cache vs live (see illustration below)

as well as pages fetched vs robots fetched.

ELASTICSEARCH

Utility class to export URL and metadata from ES index to file (#444)
Fixed sampling with aggregation spout in ES5
Upgrade to Elasticsearch 5.3 (#221 and #451)
Optimise nextFetchDate to speed up queries to Elasticsearch (#429 and #452)
Delete gone pages from index (#253)
metrics - remove filtering (#281)

One of the main changes related to Elasticsearch is the removal of ElasticsearchSpout and the introduction of CollapsingSpout, which uses the brand new FieldCollapsing in Elasticsearch. We also fixed a concurrency issue in the StatusUpdaterBolt (9fefac8), improved the efficiency of the spouts by getting them to process results in a separate thread (1b0fb42), which combined with the optimisation of nextFetchDate (see above) and the fix of the sampling in AggregationSpout, means that the Elasticsearch module is more efficient than ever.

The move to Elasticsearch 5.x was not without difficulties but the result justifies the effort. I described in a separate post the common pitfalls of upgrading an existing topology to Elasticsearch 5.

Coming next?

As usual, it is hard to guess what the next release will be made of as the project is driven by its community.

Having said that, I'd expect the Selenium-based protocol to get improved as users start to use it. It is also likely that we'll move away from Apache HttpClient library (#443). As mentioned in the previous release, we'll probably upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser.

In the meantime and as usual, thanks to all contributors and users and happy crawling!

Monday, 15 May 2017

Avoid these common pitfalls when upgrading StormCrawler with Elasticsearch 5.x

The next (and probably imminent) release of StormCrawler will contain an update of Elasticsearch to version 5.3. This is definitely a good thing, as we want to keep up with the latest versions of Elasticsearch but has a few pitfalls when upgrading your existing application. Some of the changes are documented in the README but I will reiterate them here, just in case.

LOG4J dependencies

ES5 requires an upgrade in the logging dependencies of Apache Storm. You can update the dependencies in your existing Storm cluster by hand but since my patch is part of Storm 1.1.0, you should probably upgrade Storm altogether. StormCrawler 1.5 will depend on Storm 1.1.0 (but probably works with older versions as well).

Maven Shade Configuration

The pom file of your StormCrawler-based project needs modifying as well, you'll need to specify the Maven Shade Configuration and include:

<manifestEntries>
 <Change></Change>
 <Build-Date></Build-Date>
</manifestEntries>

See https://github.com/elastic/elasticsearch/issues/21627; this wasn't an issue with the previous versions of Elasticsearch.

Update es-conf.yaml

In particular, the value of es.status.bucket.field used to be _routing, which is an automatically generated field, however this is not available for the spouts anymore. Instead, use the same value as es.status.routing.fieldname e.g. metadata.hostname.

Mapping

ES5 should be able to read your existing indices, however, if you create a new set of indices from scratch, make sure you use the latest version of the script.

I hope this will help you for a successful upgrade, I will cover the new functionalities and improvements coming with StormCrawler 1.5 when it is released.

Happy crawling