DigitalPebble's Blog: October 2018

I've just released StormCrawler 1.11, here are the main changes, some of which require modifications of your configuration.

Users should upgrade to this version as it fixes several bugs and adds loads of functionalities.

Dependency upgrades

Tika 1.19.1 (#606)
Elasticsearch 6.4.1 (#607)
SOLR 7.5 (#624)
OKHttp 3.11.0

Core

/bugfix/ FetcherBolts original metadata overwrites metadata returned by protocol (#636)
Override Globally Configured Accepts and Accepts-Language Headers Per-URL (#634)
Support for cookies in okhttp implementation (#632)
AbstractHttpProtocol uses StringTabScheme to parse input into URL and Metadata (#631)
Improve MimeType detection for interpreted server-side languages (#630)
/bugfix/ Custom intervals in Scheduler can't contain dots (#616)
OKHTTP protocol trust all SSL certificates (#615)
HTTPClient protocol setDefaultMaxPerRoute based on max threads per queue (#594)
Fetcher Added byteLength to Metadata (#599)
URLFilters + ParseFilters refactoring (#593)
HTTPClient Add simple basic auth system (#589)

WARC

/bugfix/ WARCHdfsBolt writes zero byte files (#596)

SOLR

SOLR StatusUpdater use short status name (#627)
SOLRSpout log queries, time and number of results (#623)
SOLR spout - reuse nextFetchDate (#622)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

SQL

MetricsConsumer (#612)
Batch PreparedStatements in SQL status updater bolt, fixes (#610)
QLSpout group by hostname and get top N results (#609)
Harmonise param names for SQL (#619)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

Elasticsearch

/bugfix/ NPE in AggregationSpout when there is not any status index created (#597)
/bugfix/ NPE in CollapsingSpout (#595)
Added ability to implement custom indexes names based on metadata information (#591)
StatusMetricsBolt - Added check for avoid NPE when interacting with multi search response (#598)
Change default value of es.status.reset.fetchdate.after (#590)
Log error if elastic search reports an unexpected problem (#575)
ES Wrapper for URLFilters implementing JSONResource (#588)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

As you've probably noticed, #617 affects ES, SOLR as well as SQL. The idea behind it is that the spout in these modules have a lot in common as they all query a backend for URLs to fetch. We moved some of the functionalities to a brand new class AbstractQueryingSpout, which greatly reduces the amount of code. The handling of the URL caching, TTL for the purgatory and min delay between queries is now done in that class. As a result, the spouts implementations have less to do and can focus on the specifics of getting the data from their respective backends. A nice side effect is that the SQL and SOLR spouts now benefit from some of the functionalities which were up to now only available in ES.

You will need to update your configuration to replace the elements which were specific to ES by the generic ones i.e. spout.reset.fetchdate.after, spout.ttl.purgatory and spout.min.delay.queries. These are also used by SOLR and SQL.

Please note that these changes also impact some of the metrics names.

Coming next...

Storm 2.0.0 should be released soon, which is very exciting! There's a branch of StormCrawler which anticipates some of the changes, even though it hasn't been tested much yet. Give it a try if you want to be on the cutting edge!

I expect the SOLR and SQL backends to get further improvements and progressively catch up with our Elasticsearch resources.

Finally, our Bristol workshop next month is now full but there is one in Vilnius on 27/11. I'll also give a talk there the following day. If you are around, come and say hi and get yourself a StormCrawler sticker.

As usual, thanks to all contributors and users. Happy crawling!

Thursday, 18 October 2018

What's new in StormCrawler 1.11