Sunday 6 January 2019

What's new in StormCrawler 1.13


Happy new year!

I have just released StormCrawler 1.13, which contains important bug fixes and some nice improvements.

As usual, we advise users to upgrade to this version.


Dependency upgrades

  • Xerces 2.12.0 (#672)
  • Guava 27.0.1 (#672)
  • Elasticsearch 6.5.3 (#672)
  • Jackson 2.8.11.3 (14e44)

Core

  • FileSpout uses StringTabScheme by default (#664)
  • JSoupParserBolt outlink limit per page (#670)
  • /BUGFIX/ Date format used for HTTP if-modified-since requests must follow RFC7231 (#674)
  • /BUGFIX/ DeletionBolt expects Metadata from tuples (#675)
  • Added configurable TextExtractor to JSoupParserBolt (#678)
  • !BREAKING! Core Spouts should use status stream if withDiscoveredStatus is set to true (#677)

SQL

  • SQL IndexerBolt (#608)

Archetype

  • Archetype sets StormCrawler version in a property (#668)
  • Replace ContentFilter with TextExtractor (#678)

Apart from the changes to the core spouts (#664 and #677), the main new feature is the addition of the TextExtractor (#678) for the JsoupParserBolt. Unlike the ContentParseFilter, which it replaces, it is configured from the main configuration and is not a ParseFilter as it operates directly on the objects generated by Jsoup. The TextExtractor allows restricting the text to specific elements to avoid boilerplate code and navigation elements but provides a far cleaner text content compared to the ContentParseFilter which merges some tokens. The TextExtractor can also be used to define exclusion zones which will be applied either to the restricted zones or the whole document if no such zone were defined or found. This is useful for instance to remove SCRIPT or STYLE elements.


As usual, thanks to all contributors and users,  and particularly the Government of Northwest Territories in Canada who kindly donated some of the code of the TextExtractor.

Happy crawling!

    Thursday 22 November 2018

    What's new in StormCrawler 1.12

    The previous release was only last month but I decided to ship this one now as it contains several bugfixes and improvements which many users would benefit from.


    As you can see below, the main changes are around protocols and sitemaps. We have used Selenium and OKHTTP a lot recently to deal with dynamic websites and the changes below definitely help for these. There is also an important bugfix for JSOUP (#653) and various other improvements.

    As usual, we advise users to upgrade to this version.

    Dependency upgrades


    • JSOUP 1.11.3 (#663)
    • Elasticsearch 6.5.0  (#661)
    • Jackson and Wiremock dependencies (#640)

    Core

    • Post JSON data with OKHTTP protocol via metadata (#641)
    • Selenium RemoteDriverProtocol triggered by K/V in metadata  (#642)
    • SeleniumProtocol NavigationFilters not reached in case of a redirection (#643)
    • Limit crawl to URLs found in sitemaps  (#645)
    • spout.reset.fetchdate.after based on time when query was set to NOW  (#648)
    • Avoid StackOverflowError when generating DocumentFragment from JSOUP (#653)
    • redirected sitemaps don't have isSitemap=true  (#660)
    • Staggered scheduling of sitemap URLs (#657)
    • Scheduling -> round to the closest second, minute or hour (#654)
    • FetcherBolt don't add discovered sitemaps if the robots rules do not allow them (#662)

    WARC

    • WARC record format: trailing zero byte causes WARC parser to fail  (#652)

    Elasticsearch

    • ES IndexerBolt track number of batch sent (#540)
    • Rename index index into docs (#649)
    • ES StatusMetricsBolt generate metrics for total number of docs (#651)

    Coming next...



    The release of Storm 2.0.0 has taken longer than expected, which is partly my fault as I reported a number of issues. These issues have now been fixed and hopefully, 2.0.0 will be out soon. As mentioned last month, there's a branch of StormCrawler which works on the Storm 2.x branch. Give it a try if you want to be on the cutting edge!

    Finally, there will be a StormCrawler workshop in Vilnius next week. I am sure tickets are still available if you fancy a last minute trip to Lithuania.

    As usual, thanks to all contributors and users. Happy crawling!

    UPDATE

    There were 2 bugs in release 1.12 which have been fixed in 1.12.1, see details on 







    Thursday 18 October 2018

    What's new in StormCrawler 1.11

    I've just released StormCrawler 1.11, here are the main changes, some of which require modifications of your configuration.

    Users should upgrade to this version as it fixes several bugs and adds loads of functionalities.

    Dependency upgrades
    • Tika 1.19.1 (#606)
    • Elasticsearch 6.4.1 (#607)
    • SOLR 7.5 (#624)
    • OKHttp 3.11.0
    Core

    • /bugfix/ FetcherBolts original metadata overwrites metadata returned by protocol (#636)
    • Override Globally Configured Accepts and Accepts-Language Headers Per-URL  (#634)
    • Support for cookies in okhttp implementation (#632)
    • AbstractHttpProtocol uses StringTabScheme to parse input into URL and Metadata  (#631)
    • Improve MimeType detection for interpreted server-side languages (#630)
    • /bugfix/ Custom intervals in Scheduler can't contain dots  (#616)
    • OKHTTP protocol trust all SSL certificates (#615)
    • HTTPClient protocol setDefaultMaxPerRoute based on max threads per queue (#594)
    • Fetcher Added byteLength to Metadata (#599)
    • URLFilters + ParseFilters refactoring (#593)
    • HTTPClient Add simple basic auth system (#589)
    WARC

    • /bugfix/ WARCHdfsBolt writes zero byte files (#596)
    SOLR
    • SOLR StatusUpdater use short status name (#627)
    • SOLRSpout log queries, time and number of results (#623)
    • SOLR spout - reuse nextFetchDate (#622)
    • Move reset.fetchdate.after to AbstractQueryingSpout (#628)
    • Abstract functionalities of spout implementations (#617) - see below
    SQL
    • MetricsConsumer (#612)
    • Batch PreparedStatements in SQL status updater bolt, fixes (#610)
    • QLSpout group by hostname and get top N results (#609)
    • Harmonise param names for SQL (#619)
    • Move reset.fetchdate.after to AbstractQueryingSpout (#628)
    • Abstract functionalities of spout implementations (#617) - see below

    Elasticsearch
    • /bugfix/ NPE in AggregationSpout when there is not any status index created (#597)
    • /bugfixNPE in CollapsingSpout (#595)
    • Added ability to implement custom indexes names based on metadata information (#591)
    • StatusMetricsBolt - Added check for avoid NPE when interacting with multi search response  (#598)
    • Change default value of es.status.reset.fetchdate.after (#590)
    • Log error if elastic search reports an unexpected problem (#575)
    • ES Wrapper for URLFilters implementing JSONResource (#588)
    • Move reset.fetchdate.after to AbstractQueryingSpout (#628)
    • Abstract functionalities of spout implementations (#617) - see below
    As you've probably noticed, #617 affects ES, SOLR as well as SQL. The idea behind it is that the spout in these modules have a lot in common as they all query a backend for URLs to fetch. We moved some of the functionalities to a brand new class AbstractQueryingSpout, which greatly reduces the amount of code. The handling of the URL caching, TTL for the purgatory and min delay between queries is now done in that class. As a result, the spouts implementations have less to do and can focus on the specifics of getting the data from their respective backends. A nice side effect is that the SQL and SOLR spouts now benefit from some of the functionalities which were up to now only available in ES.

    You will need to update your configuration to replace the elements which were specific to ES by the generic ones i.e. spout.reset.fetchdate.afterspout.ttl.purgatory and spout.min.delay.queries. These are also used by SOLR and SQL. 

    Please note that these changes also impact some of the metrics names.

    Coming next...

    Storm 2.0.0 should be released soon, which is very exciting! There's a branch of StormCrawler which anticipates some of the changes, even though it hasn't been tested much yet. Give it a try if you want to be on the cutting edge!

    I expect the SOLR and SQL backends to get further improvements and progressively catch up with our Elasticsearch resources.

    Finally, our Bristol workshop next month is now full but there is one in Vilnius on 27/11. I'll also give a talk there the following day. If you are around, come and say hi and get yourself a StormCrawler sticker.

    As usual, thanks to all contributors and users. Happy crawling!



    Thursday 14 June 2018

    What's new in StormCrawler 1.10


    StormCrawler 1.9 is only a couple of weeks old but the new functionalities added since justify a new release.

    Dependency upgrades

    • Apache Storm 1.2.2 (#583)
    • Crawler-Commons 0.10 (#580)
    • Elasticsearch 6.3.0 (#587)

    Archetype

    • parsefilters: added CommaSeparatedToMultivaluedMetadata to split parse.keywords
    • bugfix: java topology in archetype does not use FeedParserBolt, fixes #551
    • bugfix: archetype - move SC dependency to first place to avoid STORM-2428, fixes #559

    Elasticsearch

    • IndexerBolt set pipeline via config (#584)
    • Wrapper for loading JSON-based ParseFilters from ES (#569) - see below
    Core
    • SimpleFetcherBolt to send URLs back to its own queue if time to wait above threshold (#582)
    • ParseFilter to tag a document based on pattern matching on its URL (#577)
    • New URL filter implementation based on JSON file and organised per hostname or domain #578


    Let's have a closer look at some of the points above.

    The CollectionTagger is a ParseFilter provides a similar functionality to what Collections are in Google Search Appliance, namely the ability to add a key value in the metadata based on the URL of a document matching one or more regular expressions. The rules are expressed in a JSON file and look like 

    {
       "collections": [{
                "name": "stormcrawler",
                "includePatterns": ["http://stormcrawler.net/.+"]
            },
            {
                "name": "crawler",
                "includePatterns": [".+crawler.+", ".+nutch.+"],
                "excludePatterns": [".+baby.+", ".+spider.+"]
            }
        ]
    }

    Please note that the format is different from what GSA does but it can achieve the same thing. 

    So far, nothing revolutionary, the resource file gets loaded from the uber-jar, just like any other resource. However, what we introduced at the same time is the interface JSONResource, which CollectionTagger implements. This interface defines how implementations load a JSON file to build their resources.

    Here comes the interesting bit. We added a new resource for Elasticsearch in #569 called JSONResourceWrapper. As the name suggests, this wraps any ParseFilter implementing JSONResource and delegates the filtering to it. What it also does, is that it allows loading the JSON resource from an Elasticsearch document instead of the uber-jar and reloads it periodically. This allows you to update a resource without having to recompile the uber-jar and restart the topology

    The wrapper is configured in the usual way i.e via the parsefilter.json file, like so

    {
     "class": "com.digitalpebble.stormcrawler.elasticsearch.parse.filter.JSONResourceWrapper",
         "name": "ESCollectionTagger",
         "params": {
             "refresh": "60",
             "delegate": {
                 "class": "com.digitalpebble.stormcrawler.parse.filter.CollectionTagger",
                 "params": {
                     "file": "collections.json"
                 }
             }
         }
     }

    The JSONResourceWrapper also needs to know where Elasticsearch lives. This is set via the usual configuration file:

      es.config.addresses: "localhost"
      es.config.index.name: "config"
      es.config.doc.type: "config"
      es.config.settings:
        cluster.name: "elasticsearch"

    You can then push a modified version of the resources to Elasticsearch e.g. with CURL

    curl -XPUT 'localhost:9200/config/config/collections.json?pretty' -H 'Content-Type: application/json' -d @collections.json


    Another resource we introduced in this release is the FastURLFilter, which also implements JSONResource (but as there isn't a Wrapper for URLFilters yet, can't be loaded from ES). This is similar to the existing URL filter we have in that it allows to remove URLs based on regular expressions, however, it organises the rules per domain or hostname which makes it more efficient as a URL doesn't have to be checked against all the patterns, just the ones for its domain. There is even a scope based on metadata key/values, for instance, if some of your seeds were organised by collection, as well as a global scope which is tried for all URLs if nothing else matched.

    The resource file looks like 

    [
           {
    "scope": "GLOBAL",
    "patterns": [
    "DenyPathQuery \\.jpg"
    ]
    },
    {
    "scope": "domain:stormcrawler.net",
    "patterns": [
    "AllowPath /digitalpebble/",
    "DenyPath .+"
    ]
    },
    {
    "scope": "metadata:key=value",
    "patterns": [
    "DenyPath .+"
    ]
    }
    ]

    where the Query suffix indicates whether the pattern should be matched against the path + query element or just the path.

    I hope you like this new release of StormCrawler and the new features it brings. I would like to thank all the users and contributors and particularly the Government of Northwest Territories in Canada who kindly donated some of the code of the CollectionTagger.

    Happy Crawling!
     

    Friday 25 May 2018

    What's new in StormCrawler 1.9


    Dependency upgrades
    Core
    • Crawl-delay in robots.txt should optionally not shrink the configured delay #549
    • Optimisation: faster extraction of META tags #553
    • CollectionMetric synchronized access to List #555
    • Configurable Robots Caches #557
    • JSOUPParserBolt: lazy DOM conversion #563
    • Purge internal queues of tuples which have already reached timeout #564
    • Added ParseFilter to convert single valued Metadata to multi-valued ones #571
    • Caching of redirected robots.txt may overwrite correct robots.txt rules, fixes #573
    WARC
    • WARCBolt to handle incorrect URIs gracefully #560
    • WARCRecordFormat use ByteBuffer instead of ByteArrayOutputStream #561
    Archetype
    • Uses flux-core 1.2.1 #559
    • Added FeedParser to archetype topology #551
    • Added .kml and .wmv to url filters
    SOLR
    • MetricsConsumer handles recursive values #554
    Elasticsearch
    • MetricsConsumer handles recursive values #554
    • ES Indexer and Deletion Bolts to get index name from constructor #572
    LanguageID
    • Added option to LanguageID to skip if metadata already set #570
    As usual, we advise all users to move to this version as it fixes several bugs. Thanks to all contributors and users. Happy crawling!

    Friday 23 March 2018

    Grafana StormCrawler metrics v4


    The Grafana dashboard for StormCrawler is a good starting point for monitoring the behaviour of your StormCrawler topology. This is typically used with Elasticsearch as a storage backend for the metrics generated by Storm but should work with any other Storm-compatible backend like Grafite or CloudWatch. 

    Some of the metrics are specific to the components from the Elasticsearch module (spout, status, indexer) but you can simply remove or modify them if you use e.g. SOLR (NOTE: there was a feature request in Grafana to add SOLR as a datasource but to my knowledge, this is not yet available).

    The latest version (4) brings the following changes.

    • URLs waiting in queues 

    The recent 1.8 release of StormCrawler added a new metrics for the FetcherBolt which allows tracking the amount of time URLs spend in the internal queues. This has been added to the "URLs waiting in queues" panel alongside the average population of the queues.

    Average time spent in queues + average queues population

    • ES StatusUpdater
    Instead of tracking the number of bulk requests sent in the last minute, we now have a panel showing the evolution over time. This information is for the ES StatusUpdaterBolt only.

    ES status updater bulk requests
    • Acked in StatusBolt
    This is a brand new panel which is not specific to Elasticsearch but operates on any component with 'status' for id and shows the number of tuples acked over time, broken down by source.  

    Tuples acked by StatusUpdater
    In the graph above, we can see a peak early in the crawl where most of the tuples acked came from the sitemap bolt. Please note that the values are stacked in this graph. Sitemap files are typically discovered early in a crawl and generate a large number of discovered URLs; this is not the case later on when most tuples come from the HTML parser.
    • Robots panel
    We removed the robots panel as the number of HTTP requests to robots files is shown in the "Fetcher: pages fetched" panel anyway and after the initial few minutes of a crawl, the panel simply indicated that the robots files were mostly cached.
    • ES Indexed 
    This is a new panel showing the number of documents indexed into Elasticsearch as well as the documents filtered out during the indexing.



    Tuesday 20 March 2018

    What's new in StormCrawler 1.8

    I have just released StormCrawler 1.8. As usual, here is a summary of the main changes:


    Dependency updates
    Core
    • Add option to send only N bytes of text to indexers #476
    • BasicURLNormalizer to optionally convert IDN host names to ASCII/Punycode #522
    • MemorySpout to generate tuples with DISCOVERED status #529
    • OKHttp configure type of proxy #530
    • http.content.limit inconsistent default to -1 #534
    • Track time spent in the FetcherBolt queues #535
    • Increase detect.charset.maxlength default value #537
    • FeedParserBolt: metadata added by parse filters not passed forward in topology #541
    • Use UTF-8 for input encoding of seeds (FileSpout) #542
    • Default URL filter: exclude localhost and private address spaces #543
    • URLStreamGrouping returns the taskIDs and not their index #547
    WARC
    • Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #520
    SOLR
    • Schema for status index needs date type for nextFetchDate #544
    • SOLR indexer: use field type text for content field #545
    Elasticsearch
    • AggregationSpout fails with default value of es.status.bucket.field == _routing #521
    • Move to Elasticsearch RESTAPi #539
    We recommend all users to move to this version as it fixes several bugs (#541#547) and adds some great new features. In particular, the use of the REST API for Elasticsearch, which makes the module future-proof but also easier to configure, but also #535 and #543.

    As usual, thanks to all contributors and users. Happy crawling!