StormCrawler 2.10 was released yesterday and, as usual, it contains loads of improvements, dependency upgrades and bug fixes. Instead of going through each one of them, we will focus specifically on what was done for protocols.
First, every protocol implementation can now easily be tested on the command line, even FileProtocol or DelegatorProtocol thanks to #1097. For instance, 
storm local target/xxx-1.0-SNAPSHOT.jar com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol -f crawler-conf.yaml https://storm.apache.org/ -b
Using the command above in combination with debugging is particularly powerful. This can easily be done with
One of the main changes is about the Selenium module. It had been a while since it had any work done to it and its configuration was pretty obsolete. With #1093, we added some much needed unit tests and removed some incompatible configuration. #1100 added an option to deactivate tracing and we fixed the user agent substitution (#1109).
An important and incompatible change in the Selenium module is about the way the timeouts are configured. The previous mechanism was opaque and error prone. This has been replaced in #1101, the timeouts are now configured with a map
  selenium.timeouts:
    script: -1
    pageLoad: -1
    implicit: -1
with -1 preserving the Selenium default values.
The DelegatorProtocol has also been greatly improved. If you are not familiar with it, it allows you to determine which protocol implementation should be used for a URL given the metadata it has. For instance, 
  # use the normal protocol for sitemaps
  protocol.delegator.config:
   - className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
     filters:
       isSitemap: "true"
   - className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"will use the OKHTTP protocol for a URL if is has a key isSitemap in its metadata with a value of true. Otherwise it will use the Selenium implementation.
With #1098, we added an operator indicating whether the conditions should be treated as an AND or OR. We also added the possibility to triage based on regular expressions on the URL itself (#1110). 
You can now express more complex configurations such as  
As a result, we removed the deprecated class DelegatorRemoteDriverProtocol.
The DelegatorProtocol is of course particularly useful for avoiding sending URLs to the Selenium implementation unnecessarily, as illustrated above.
StormCrawler 2.10 contains of course other changes and dependency updates and, as usual, we recommend that you switch to it.  As we have seen today, the improvements we added to make protocol implementations easier to test and configure should be a reason to upgrade.
We would like to thank all the users and contributors to the 2.10 release.
Happy crawling!
 
 
