Thursday, 16 January 2020

What's new in StormCrawler 1.16?

Happy new year!

StormCrawler 1.16 was released a couple of days ago. You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/26?closed=1

As usual, we recommend that all users upgrade to this version as it contains important fixes and performance improvements.

Dependency upgrades

  • Tika 1.23 (#771)
  • ES 7.5.0 (#770
  • jackson-databind from 2.9.9.2 to 2.9.10.1 dependency (#767)

Core

  • OKHttp configure authentication for proxies (#751)
  • Make URLBuffer configurable + AbstractURLBuffer uses URLPartitioner (#754)
  • /bugfix/ okhttp protocol: reliably mark trimmed content because of content limit (#757)
  • /!breaking!/ urlbuffer code in a separate package + 2 new implementations (#764)
  • Crawl-delay handling: allow `fetcher.max.crawl.delay` exceed 300 sec.(#768)
  • okhttp protocol: HTTP request header lacks protocol name and version (#775)
  • Locking mechanism for Metadata objects (#781)

LangID

  • /bugfix/ langID parse filter gets stuck (#758)

Elasticsearch

  • /bugfix/ Fix NullPointerException in JSONResourceWrappers  (#760)
  • ES specify field used for grouping the URLs explicitly in mapping (#761)
  • Use search after for pagination in HybridSpout (#762)
  • Filter queries in ES can be defined as lists (#765)
  • es.status.bucket.sort.field can take a list of values (#766)
  • Archetype for SC+Elasticsearch (#773)
  • ES merge seed injection into crawl topology (#778)
  • Kibana - change format of templates to ndjson (#780)
  • /bugfix/ HybridSpout get key for results when prefixed by "metadata." (#782)
  • AggregationSpout to store sortValues for the last result of each bucket (#783)
  • Import Kibana dashboards using the API (#785)
  • Include Kibana script and resources in ES archetype (#786)

One of the main improvements in 1.16 is the addition of a Maven archetype to generate a crawl topology using Elasticsearch as a backend (#773). This is done by calling

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-elasticsearch-archetype -DarchetypeVersion=LATEST

The generated project also contains a script and resources to load templates into Kibana.

The topology for Elasticsearch now includes the injection of seeds from a file, which was previously in a separate topology. These changes should help beginners get started with StormCrawler.

The previous release included URLBuffers, with just one simple implementation. Two new implementations have been added in #764. The brand new PriorityURLBuffer sorts the buckets by the number of acks they got since the last sort whereas the SchedulingURLBuffer tries to guess when a queue should release a URL based on how long it took its previous URLs to be acked on average. The former has been used extensively with the HybridSpout but the latter is still experimental.

Finally, we added a soft locking mechanism to Metadata (#781)  to help trace the source of ConcurrentModificationExceptions. If you are experiencing such exceptions, calling metadata.lock() when emitting e.g.

collector.emit(StatusStreamName, tuple, new Values(url, metadata.lock(), Status.FETCHED))

will trigger an exception whenever the metadata object is modified somewhere else. You might need to call unlock() in the subsequent bolts.

This does not change the way the Metadata works but is just there to help you debug.

Hopefully, we should be able to release 2.0 in the next few months. In the meantime, happy crawling and a massive thank you to all contributors!