DigitalPebble's Blog: storm

Showing posts with label storm. Show all posts

Thursday 18 October 2018

What's new in StormCrawler 1.11

I've just released StormCrawler 1.11, here are the main changes, some of which require modifications of your configuration.

Users should upgrade to this version as it fixes several bugs and adds loads of functionalities.

Dependency upgrades

Tika 1.19.1 (#606)
Elasticsearch 6.4.1 (#607)
SOLR 7.5 (#624)
OKHttp 3.11.0

Core

/bugfix/ FetcherBolts original metadata overwrites metadata returned by protocol (#636)
Override Globally Configured Accepts and Accepts-Language Headers Per-URL (#634)
Support for cookies in okhttp implementation (#632)
AbstractHttpProtocol uses StringTabScheme to parse input into URL and Metadata (#631)
Improve MimeType detection for interpreted server-side languages (#630)
/bugfix/ Custom intervals in Scheduler can't contain dots (#616)
OKHTTP protocol trust all SSL certificates (#615)
HTTPClient protocol setDefaultMaxPerRoute based on max threads per queue (#594)
Fetcher Added byteLength to Metadata (#599)
URLFilters + ParseFilters refactoring (#593)
HTTPClient Add simple basic auth system (#589)

WARC

/bugfix/ WARCHdfsBolt writes zero byte files (#596)

SOLR

SOLR StatusUpdater use short status name (#627)
SOLRSpout log queries, time and number of results (#623)
SOLR spout - reuse nextFetchDate (#622)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

SQL

MetricsConsumer (#612)
Batch PreparedStatements in SQL status updater bolt, fixes (#610)
QLSpout group by hostname and get top N results (#609)
Harmonise param names for SQL (#619)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

Elasticsearch

/bugfix/ NPE in AggregationSpout when there is not any status index created (#597)
/bugfix/ NPE in CollapsingSpout (#595)
Added ability to implement custom indexes names based on metadata information (#591)
StatusMetricsBolt - Added check for avoid NPE when interacting with multi search response (#598)
Change default value of es.status.reset.fetchdate.after (#590)
Log error if elastic search reports an unexpected problem (#575)
ES Wrapper for URLFilters implementing JSONResource (#588)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

As you've probably noticed, #617 affects ES, SOLR as well as SQL. The idea behind it is that the spout in these modules have a lot in common as they all query a backend for URLs to fetch. We moved some of the functionalities to a brand new class AbstractQueryingSpout, which greatly reduces the amount of code. The handling of the URL caching, TTL for the purgatory and min delay between queries is now done in that class. As a result, the spouts implementations have less to do and can focus on the specifics of getting the data from their respective backends. A nice side effect is that the SQL and SOLR spouts now benefit from some of the functionalities which were up to now only available in ES.

You will need to update your configuration to replace the elements which were specific to ES by the generic ones i.e. spout.reset.fetchdate.after, spout.ttl.purgatory and spout.min.delay.queries. These are also used by SOLR and SQL.

Please note that these changes also impact some of the metrics names.

Coming next...

Storm 2.0.0 should be released soon, which is very exciting! There's a branch of StormCrawler which anticipates some of the changes, even though it hasn't been tested much yet. Give it a try if you want to be on the cutting edge!

I expect the SOLR and SQL backends to get further improvements and progressively catch up with our Elasticsearch resources.

Finally, our Bristol workshop next month is now full but there is one in Vilnius on 27/11. I'll also give a talk there the following day. If you are around, come and say hi and get yourself a StormCrawler sticker.

As usual, thanks to all contributors and users. Happy crawling!

Thursday 14 June 2018

What's new in StormCrawler 1.10

StormCrawler 1.9 is only a couple of weeks old but the new functionalities added since justify a new release.

Dependency upgrades

Apache Storm 1.2.2 (#583)
Crawler-Commons 0.10 (#580)
Elasticsearch 6.3.0 (#587)

Archetype

parsefilters: added CommaSeparatedToMultivaluedMetadata to split parse.keywords
bugfix: java topology in archetype does not use FeedParserBolt, fixes #551
bugfix: archetype - move SC dependency to first place to avoid STORM-2428, fixes #559

Elasticsearch

IndexerBolt set pipeline via config (#584)
Wrapper for loading JSON-based ParseFilters from ES (#569) - see below

Core

SimpleFetcherBolt to send URLs back to its own queue if time to wait above threshold (#582)
ParseFilter to tag a document based on pattern matching on its URL (#577)
New URL filter implementation based on JSON file and organised per hostname or domain #578

Let's have a closer look at some of the points above.

The CollectionTagger is a ParseFilter provides a similar functionality to what Collections are in Google Search Appliance, namely the ability to add a key value in the metadata based on the URL of a document matching one or more regular expressions. The rules are expressed in a JSON file and look like

{

"collections": [{

"name": "stormcrawler",

"includePatterns": ["http://stormcrawler.net/.+"]

{

"name": "crawler",

"includePatterns": [".+crawler.+", ".+nutch.+"],

"excludePatterns": [".+baby.+", ".+spider.+"]

}

]

}

Please note that the format is different from what GSA does but it can achieve the same thing.

So far, nothing revolutionary, the resource file gets loaded from the uber-jar, just like any other resource. However, what we introduced at the same time is the interface JSONResource, which CollectionTagger implements. This interface defines how implementations load a JSON file to build their resources.

Here comes the interesting bit. We added a new resource for Elasticsearch in #569 called JSONResourceWrapper. As the name suggests, this wraps any ParseFilter implementing JSONResource and delegates the filtering to it. What it also does, is that it allows loading the JSON resource from an Elasticsearch document instead of the uber-jar and reloads it periodically. This allows you to update a resource without having to recompile the uber-jar and restart the topology.

The wrapper is configured in the usual way i.e via the parsefilter.json file, like so

{

"class": "com.digitalpebble.stormcrawler.elasticsearch.parse.filter.JSONResourceWrapper",

"name": "ESCollectionTagger",

"params": {

"refresh": "60",

"delegate": {

"class": "com.digitalpebble.stormcrawler.parse.filter.CollectionTagger",

"params": {

"file": "collections.json"

}

The JSONResourceWrapper also needs to know where Elasticsearch lives. This is set via the usual configuration file:

es.config.addresses: "localhost"

es.config.index.name: "config"

es.config.doc.type: "config"

es.config.settings:

cluster.name: "elasticsearch"

You can then push a modified version of the resources to Elasticsearch e.g. with CURL

curl -XPUT 'localhost:9200/config/config/collections.json?pretty' -H 'Content-Type: application/json' -d @collections.json

Another resource we introduced in this release is the FastURLFilter, which also implements JSONResource (but as there isn't a Wrapper for URLFilters yet, can't be loaded from ES). This is similar to the existing URL filter we have in that it allows to remove URLs based on regular expressions, however, it organises the rules per domain or hostname which makes it more efficient as a URL doesn't have to be checked against all the patterns, just the ones for its domain. There is even a scope based on metadata key/values, for instance, if some of your seeds were organised by collection, as well as a global scope which is tried for all URLs if nothing else matched.

The resource file looks like

[

{

"scope": "GLOBAL",

"patterns": [

"DenyPathQuery \\.jpg"

]

},

{

"scope": "domain:stormcrawler.net",

"patterns": [

"AllowPath /digitalpebble/",

"DenyPath .+"

]

},

{

"scope": "metadata:key=value",

"patterns": [

"DenyPath .+"

]

}

]

where the Query suffix indicates whether the pattern should be matched against the path + query element or just the path.

I hope you like this new release of StormCrawler and the new features it brings. I would like to thank all the users and contributors and particularly the Government of Northwest Territories in Canada who kindly donated some of the code of the CollectionTagger.

Happy Crawling!

Friday 25 May 2018

What's new in StormCrawler 1.9

Dependency upgrades

OKHttp 3.10.0 #546
JSoup 1.11.2 #552
icu4j 61.1 #556
Rometools 1.9.0 #556
HTTPClient 4.5.5 #558
Tika 1.18 #566

Core

Crawl-delay in robots.txt should optionally not shrink the configured delay #549
Optimisation: faster extraction of META tags #553
CollectionMetric synchronized access to List #555
Configurable Robots Caches #557
JSOUPParserBolt: lazy DOM conversion #563
Purge internal queues of tuples which have already reached timeout #564
Added ParseFilter to convert single valued Metadata to multi-valued ones #571
Caching of redirected robots.txt may overwrite correct robots.txt rules, fixes #573

WARC

WARCBolt to handle incorrect URIs gracefully #560
WARCRecordFormat use ByteBuffer instead of ByteArrayOutputStream #561

Archetype

Uses flux-core 1.2.1 #559
Added FeedParser to archetype topology #551
Added .kml and .wmv to url filters

SOLR

MetricsConsumer handles recursive values #554

Elasticsearch

MetricsConsumer handles recursive values #554
ES Indexer and Deletion Bolts to get index name from constructor #572

LanguageID

Added option to LanguageID to skip if metadata already set #570

As usual, we advise all users to move to this version as it fixes several bugs. Thanks to all contributors and users. Happy crawling!

Friday 23 March 2018

Grafana StormCrawler metrics v4

The Grafana dashboard for StormCrawler is a good starting point for monitoring the behaviour of your StormCrawler topology. This is typically used with Elasticsearch as a storage backend for the metrics generated by Storm but should work with any other Storm-compatible backend like Grafite or CloudWatch.

Some of the metrics are specific to the components from the Elasticsearch module (spout, status, indexer) but you can simply remove or modify them if you use e.g. SOLR (NOTE: there was a feature request in Grafana to add SOLR as a datasource but to my knowledge, this is not yet available).

The latest version (4) brings the following changes.

URLs waiting in queues

The recent 1.8 release of StormCrawler added a new metrics for the FetcherBolt which allows tracking the amount of time URLs spend in the internal queues. This has been added to the "URLs waiting in queues" panel alongside the average population of the queues.

Average time spent in queues + average queues population

ES StatusUpdater

Instead of tracking the number of bulk requests sent in the last minute, we now have a panel showing the evolution over time. This information is for the ES StatusUpdaterBolt only.

ES status updater bulk requests

Acked in StatusBolt

This is a brand new panel which is not specific to Elasticsearch but operates on any component with 'status' for id and shows the number of tuples acked over time, broken down by source.

Tuples acked by StatusUpdater

In the graph above, we can see a peak early in the crawl where most of the tuples acked came from the sitemap bolt. Please note that the values are stacked in this graph. Sitemap files are typically discovered early in a crawl and generate a large number of discovered URLs; this is not the case later on when most tuples come from the HTML parser.

Robots panel

We removed the robots panel as the number of HTTP requests to robots files is shown in the "Fetcher: pages fetched" panel anyway and after the initial few minutes of a crawl, the panel simply indicated that the robots files were mostly cached.

ES Indexed

This is a new panel showing the number of documents indexed into Elasticsearch as well as the documents filtered out during the indexing.

Tuesday 20 March 2018

What's new in StormCrawler 1.8

I have just released StormCrawler 1.8. As usual, here is a summary of the main changes:

Dependency updates

Storm 1.2.1 #531
SOLR 7.2.1 #528
Tika 1.17 #518
Elasticsearch 6.2.2 #525 and #539

Core

Add option to send only N bytes of text to indexers #476
BasicURLNormalizer to optionally convert IDN host names to ASCII/Punycode #522
MemorySpout to generate tuples with DISCOVERED status #529
OKHttp configure type of proxy #530
http.content.limit inconsistent default to -1 #534
Track time spent in the FetcherBolt queues #535
Increase detect.charset.maxlength default value #537
FeedParserBolt: metadata added by parse filters not passed forward in topology #541
Use UTF-8 for input encoding of seeds (FileSpout) #542
Default URL filter: exclude localhost and private address spaces #543
URLStreamGrouping returns the taskIDs and not their index #547

WARC

Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #520

SOLR

Schema for status index needs date type for nextFetchDate #544
SOLR indexer: use field type text for content field #545

Elasticsearch

AggregationSpout fails with default value of es.status.bucket.field == _routing #521
Move to Elasticsearch RESTAPi #539

We recommend all users to move to this version as it fixes several bugs (#541, #547) and adds some great new features. In particular, the use of the REST API for Elasticsearch, which makes the module future-proof but also easier to configure, but also #535 and #543.

As usual, thanks to all contributors and users. Happy crawling!

Tuesday 28 November 2017

What's new in StormCrawler 1.7

Amazingly this is the 20th release of StormCrawler! Here are the main changes:

Dependencies updates

crawler-commons 0.9 #513

Core

(bugfix) ParserBolts should use outlinks from parsefilters #498
LD_JSON parsefilter #501
okhttp : store request and response headers verbatim in metadata #506
(bugfix) okhttp protocol does not store headers in metadata #507
HTTP clients should handle http.accept.language and http.accept #499
Selenium protocol follows redirections #514
RemoteDriverProtocol needs multiple instances #505
SitemapParserBolt should force mime-type based on the clue #515

Elasticsearch

ES Spout : define filter query via config #502
Upgrade to ES 6.0 #517

We recommend all users to move to this version. If you wish to remain on an older version of Elasticsearch, you can simply keep your existing version of the stormcrawler elasticsearch module while upgrading stormcrawler core.

This version improves the processing of sitemaps, via #515 and the use of the crawler-commons 0.9 where we fixed the SAX parsing and extended its coverage. We also added improvements to our okhttp-based protocol implementation. If your crawl is a wide one with potentially any sort of content then you should go for okhttp over the default httpclient one. See our comparison of protocol implementations on the WIKI.

Finally, if you want to extract semantic data represented in ld-json then you'll love #501.

As usual, thanks to all contributors and users. Happy crawling!

Monday 29 May 2017

What's new in StormCrawler 1.5

StormCrawler 1.5 has just been released! It is an important road mark with the move to Elasticsearch 5.x and the implementation of long-awaited features such as the Selenium-based protocol. The code has been improved in many ways and despite the seemingly low number of lines below, this new release is a mammoth one!

The project, in general, is in very good health, with more and more organisations using it in production, and an increased visibility, reflected by the growing number of questions on StackOverflow.

Here are the main changes in 1.5.

CORE DEPENDENCIES UPGRADES

Apache Storm 1.1.0 (#450)

CORE MODULE

HTTP Protocol: implement cookie handling (#32)
java.util.zip.ZipException: Not in GZIP format thrown on redirs with httpclient (#455)
Selenium-based protocol implementation (#144) which I described in a separate blog post
Indicate whether RobotsRules come from cache or have been fetched (#460)
Memory issues when ByteArrayBuffer gets instantiated with a large value despite maxLength being set (#462)
FetcherBolt to dump URLs being fetched to log (#464)
Override sitemapsAutoDiscovery settings per URL (#469)

Knowing whether RobotsRules come from the cache gives us more insights into the behaviour of the crawlers as we can display the ratio of cache vs live (see illustration below)

as well as pages fetched vs robots fetched.

ELASTICSEARCH

Utility class to export URL and metadata from ES index to file (#444)
Fixed sampling with aggregation spout in ES5
Upgrade to Elasticsearch 5.3 (#221 and #451)
Optimise nextFetchDate to speed up queries to Elasticsearch (#429 and #452)
Delete gone pages from index (#253)
metrics - remove filtering (#281)

One of the main changes related to Elasticsearch is the removal of ElasticsearchSpout and the introduction of CollapsingSpout, which uses the brand new FieldCollapsing in Elasticsearch. We also fixed a concurrency issue in the StatusUpdaterBolt (9fefac8), improved the efficiency of the spouts by getting them to process results in a separate thread (1b0fb42), which combined with the optimisation of nextFetchDate (see above) and the fix of the sampling in AggregationSpout, means that the Elasticsearch module is more efficient than ever.

The move to Elasticsearch 5.x was not without difficulties but the result justifies the effort. I described in a separate post the common pitfalls of upgrading an existing topology to Elasticsearch 5.

Coming next?

As usual, it is hard to guess what the next release will be made of as the project is driven by its community.

Having said that, I'd expect the Selenium-based protocol to get improved as users start to use it. It is also likely that we'll move away from Apache HttpClient library (#443). As mentioned in the previous release, we'll probably upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser.

In the meantime and as usual, thanks to all contributors and users and happy crawling!

Tuesday 4 April 2017

Video Tutorial - StormCrawler + Elasticsearch + Kibana

This tutorial explains how to configure Elasticsearch with StormCrawler.

We first bootstrap a StormCrawler project using the Maven archetype, have a look at the resources and code generated, then modify the project so that it uses Elasticsearch. We then run an injection topology and the crawl topology before setting up Kibana for monitoring the metrics and content of the status index.

(with my apologies for the quality of the sound)

Enjoy

Julien

Wednesday 29 March 2017

Full day workshop(s) on StormCrawler (+Elasticsearch and Kibana)

I will be running a full-day workshop on crawling with StormCrawler on the 24th April in Berlin. See full details on https://endoctus.com/course/web-crawling-with-stormcrawler.

Please find the program below:

In this workshop, we will explore StormCrawler a collection of resources for building low-latency, large scale web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what Storm-Crawler provides, we'll put it to use straight away for a simple crawl before moving on to the deployed mode of Storm.

In the second part of the session, we will then introduce metrics and index documents with Elasticsearch and Kibana and dive into data extraction. Finally, we'll cover recursive crawls and scalability. This course will be hands-on: attendees will run the code on their own machines.

This course will suit Java developers with an interest in big data, stream processing, web crawling and search. It will provide a practical introduction to both Apache Storm and Elasticsearch as well of course as StormCrawler and should not require advanced programming skills.

Duration : 2x3 hours

PS: Do you follow DigitalPebble or StormCrawler on Twitter? Announcements and updates are made there (as well as all sorts of interesting news of course!)

Thursday 23 March 2017

What’s new in StormCrawler 1.4

StormCrawler 1.4 has just been released! As usual, all users are advised to upgrade to this version as it fixes some bugs and contains quite a few new functionalities.

Core dependencies upgrades

Httpclient 4.5.3
Storm 1.0.3 #437

Core module

JSoupParser does not dedup outlinks properly, #375
Custom schedule based on metadata for non-success pages, #386
Adaptive fetch scheduler #407
Sitemap: increased default offset for guessing + made it configurable #409
Added URLFilterBolt + use it in ESSeedInjector #421
URLStreamGrouping 425
Better handling of redirections for HTTP robots #4372d16
HTTP Proxy over Basic Authentication #432
Improved metrics for status updater cache (hits and misses) #434
File protocol implementation #436
Added CollectionMetrics (used in ES MetricsConsumer + ES Spout, see below) #7d35acb

AWS

Added code for caching and retrieving content from AWS S3 #e16b66ef

SOLR

Basic upgrade to Solr 6.4.1
Use ConcurrentUpdateSolrClient; #183

Elasticsearch

Various changes to StatusUpdaterBolt

Fixed bugs introduced in 1.3 (use of SHA ID), synchronisation issues, better logging, optimisation of docs sent and more robust handling of tuples waiting to be acked (#426). The most important change is a bug fix whereby the cache was never hit (#442) which had a large impact on performance.

Simplified README + removed bigjar profile from pom #414
Provide basic mapping for doc index #433
Simple Grafana dashboard for SC metrics, #380
Generate metrics about status counts, #389
Spouts report time taken by queries using CollectionMetric, #439 - as illustrated below

Spout query times displayed by Grafana
(illustrating the impact of SamplerAggregationSpout on a large status index )

Coming next?

As usual, it is not clear what the next release will contain but hopefully, we'll switch to Elasticsearch 5 (you can already take it from the branch es5.3) and provide resources for Selenium (see branch jBrowserDriver). As I pointed out in my previous post, getting early feedback on work in progress is a great way of contributing to the project.

We'll probably also upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser. We might move to one of the next releases of Apache Storm, where a recent contribution I made will make it possible to use Elasticsearch 5. Also, some of our StormCrawler code has been donated to Storm, which is great!

In the meantime and as usual, thanks to all contributors and users and happy crawling!

PS: I will be running a workshop in Berlin next month about StormCrawler, Storm in general and Elasticsearch

https://www.eventbrite.co.uk/e/introduction-to-web-crawling-with-stormcrawler-and-elasticsearch-tickets-30927257259

Tuesday 10 January 2017

What's new in StormCrawler 1.3

StormCrawler 1.3 has just been released! As usual, all users are advised to upgrade to this version as it fixes some bugs and contains quite a few new functionalities and improved performance.

Dependencies upgrades

Jsoup 1.10.1
Crawler-Commons 0.7
RomeTools to 1.7.0
ICU4J 58.2

Core module

Hardcoded limit to the max # connections allowed by protocol #388
LangID module #364
JsoupParserBolt can use first N bytes for charset detection (or not at all) #391
SimpleFetcherBolt uses allowRedir from super class #394 (bugfix)
URLNormalizer : Decode non-standard percent encoding prior to re-encoding
MaxDepthFilter defaults to -1, 0 removes all outlinks, can set a custom max depth per URL with max.depth. Implements #399 and #400

The latter breaks compatibility with the previous versions: 0 was used to deactivate the filtering by depth, whereas now it is used to prevent any outlinks from being processed. Please change your config to -1 if you want to deactivate the filtering.

Elasticsearch

Flux for crawl and injection topologies #372
Use min delay for all types of Spouts #370
Remove Node client #377
ESSpout deals with deep paging before building query
Topology status updater triaged by URL to hit cache
Settings done via configuration #376
Add plugin to the clients via configuration #378
Spouts: load results with a non-blocking call #371
Concurrent requests in config #382
StatusUpdaterBolt - do not add URL already in buffer for ES if status is DISCOVERED
Allow fieldNameForRoutingKey to be outside metadata and use a different key for spouts #384
Use SHA256 as doc_id #385
Separate Kibana schema for status and metrics + put all schemas in a separate folder
Improvements to ES_IndexInit
ES crawl topology uses FetcherBolt

Please note that the cluster name is now defined alongside the other settings:

  es.status.settings:
    cluster.name: "elasticsearch"

One of the benefits of #376 and #378 is that you can now use StormCrawler with Elastic Cloud protected with Shield.

We are fast approaching our 1.000th commit! Thanks to all users and contributors for their help with StormCrawler. Happy crawling!

PS: I will be running a 1-day workshop in Berlin on the 2nd of February. Announcements will be made on our Twitter account.