Thursday 14 June 2018

What's new in StormCrawler 1.10


StormCrawler 1.9 is only a couple of weeks old but the new functionalities added since justify a new release.

Dependency upgrades

  • Apache Storm 1.2.2 (#583)
  • Crawler-Commons 0.10 (#580)
  • Elasticsearch 6.3.0 (#587)

Archetype

  • parsefilters: added CommaSeparatedToMultivaluedMetadata to split parse.keywords
  • bugfix: java topology in archetype does not use FeedParserBolt, fixes #551
  • bugfix: archetype - move SC dependency to first place to avoid STORM-2428, fixes #559

Elasticsearch

  • IndexerBolt set pipeline via config (#584)
  • Wrapper for loading JSON-based ParseFilters from ES (#569) - see below
Core
  • SimpleFetcherBolt to send URLs back to its own queue if time to wait above threshold (#582)
  • ParseFilter to tag a document based on pattern matching on its URL (#577)
  • New URL filter implementation based on JSON file and organised per hostname or domain #578


Let's have a closer look at some of the points above.

The CollectionTagger is a ParseFilter provides a similar functionality to what Collections are in Google Search Appliance, namely the ability to add a key value in the metadata based on the URL of a document matching one or more regular expressions. The rules are expressed in a JSON file and look like 

{
   "collections": [{
            "name": "stormcrawler",
            "includePatterns": ["http://stormcrawler.net/.+"]
        },
        {
            "name": "crawler",
            "includePatterns": [".+crawler.+", ".+nutch.+"],
            "excludePatterns": [".+baby.+", ".+spider.+"]
        }
    ]
}

Please note that the format is different from what GSA does but it can achieve the same thing. 

So far, nothing revolutionary, the resource file gets loaded from the uber-jar, just like any other resource. However, what we introduced at the same time is the interface JSONResource, which CollectionTagger implements. This interface defines how implementations load a JSON file to build their resources.

Here comes the interesting bit. We added a new resource for Elasticsearch in #569 called JSONResourceWrapper. As the name suggests, this wraps any ParseFilter implementing JSONResource and delegates the filtering to it. What it also does, is that it allows loading the JSON resource from an Elasticsearch document instead of the uber-jar and reloads it periodically. This allows you to update a resource without having to recompile the uber-jar and restart the topology

The wrapper is configured in the usual way i.e via the parsefilter.json file, like so

{
 "class": "com.digitalpebble.stormcrawler.elasticsearch.parse.filter.JSONResourceWrapper",
     "name": "ESCollectionTagger",
     "params": {
         "refresh": "60",
         "delegate": {
             "class": "com.digitalpebble.stormcrawler.parse.filter.CollectionTagger",
             "params": {
                 "file": "collections.json"
             }
         }
     }
 }

The JSONResourceWrapper also needs to know where Elasticsearch lives. This is set via the usual configuration file:

  es.config.addresses: "localhost"
  es.config.index.name: "config"
  es.config.doc.type: "config"
  es.config.settings:
    cluster.name: "elasticsearch"

You can then push a modified version of the resources to Elasticsearch e.g. with CURL

curl -XPUT 'localhost:9200/config/config/collections.json?pretty' -H 'Content-Type: application/json' -d @collections.json


Another resource we introduced in this release is the FastURLFilter, which also implements JSONResource (but as there isn't a Wrapper for URLFilters yet, can't be loaded from ES). This is similar to the existing URL filter we have in that it allows to remove URLs based on regular expressions, however, it organises the rules per domain or hostname which makes it more efficient as a URL doesn't have to be checked against all the patterns, just the ones for its domain. There is even a scope based on metadata key/values, for instance, if some of your seeds were organised by collection, as well as a global scope which is tried for all URLs if nothing else matched.

The resource file looks like 

[
       {
"scope": "GLOBAL",
"patterns": [
"DenyPathQuery \\.jpg"
]
},
{
"scope": "domain:stormcrawler.net",
"patterns": [
"AllowPath /digitalpebble/",
"DenyPath .+"
]
},
{
"scope": "metadata:key=value",
"patterns": [
"DenyPath .+"
]
}
]

where the Query suffix indicates whether the pattern should be matched against the path + query element or just the path.

I hope you like this new release of StormCrawler and the new features it brings. I would like to thank all the users and contributors and particularly the Government of Northwest Territories in Canada who kindly donated some of the code of the CollectionTagger.

Happy Crawling!