DigitalPebble's Blog: open source

Showing posts with label open source. Show all posts

Wednesday 5 May 2021

What's new in StormCrawler 1.18

StormCrawler 1.18 has just been released. Since the previous version dates from nearly 10 months ago, the number of changes is rather large (see below).

This version contains many bugfixes, as usual, users are advised to upgrade to this version. One of the noticeable new features is module for URLFrontier (if you haven't checked it up, do so right now!); I will publish a tutorial on how to use it soon.

1.18 is also likely to be the last release based an Apache Storm 1.x, our 2.x branch will become master as soon as I have released 2.1.

Happy crawling and thanks to our sponsors, contributors and users!

What's new in StormCrawler 1.17

I have just released StormCrawler 1.17. As you can see in the list below, this contains important bugfixes and improvements. For this reason, we recommend that all users upgrade to this version, however, please check the breaking changes below if you apply it to an existing crawl.

Dependency upgrades

Various dependency upgrades #808
CrawlerCommons 1.1 dependency #807
Tika 1.24.1 #797
Jackson-databind #803 #793 #798

Core

Use regular expressions for custom number of threads per queue fetcher #788
/!breaking!/ Prefix protocol metadata #789
Basic authentication for OKHTTP #792
Utility to debug / test parsefilters #794
/!breaking!/ Remove deprecated methods and fields enhancement #791
AdaptiveScheduler to set last-modified time in metadata #777 #812
/bugfix/ _fetch.exception_ key should be removed from metadata if subsequent fetches are successful #813
/bugfix/ SimpleFetcherBolt maxThrottleSleepMSec not deactivated #814
/!breaking!/ Index pages with content="noindex,follow" meta tag #750
Enable extension parsing for SitemapParser enhancement parser #749 #815

WARC

Implement WARC spout #755 #799

Elasticsearch

/bugfix/ AggregationSpout error due SimpleDateFormat not thread safe #809
/bugfix/ IndexerBolt issue causing ack failures #801
Allow ES to connect over a proxy #787

Of the breaking changes above, #789 is particularly important. If you want to use SC 1.17 on an existing crawl, make sure you add

protocol.md.prefix: ""

to the configuration. Similarly, http.skip.robots has changed to http.robots.file.skip

Thanks to all contributors and users! Happy crawling!

PS: something equally exciting is coming next ;-)

Thursday 16 January 2020

What's new in StormCrawler 1.16?

Happy new year!

StormCrawler 1.16 was released a couple of days ago. You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/26?closed=1

As usual, we recommend that all users upgrade to this version as it contains important fixes and performance improvements.

Dependency upgrades

Tika 1.23 (#771)
ES 7.5.0 (#770)
jackson-databind from 2.9.9.2 to 2.9.10.1 dependency (#767)

Core

OKHttp configure authentication for proxies (#751)
Make URLBuffer configurable + AbstractURLBuffer uses URLPartitioner (#754)
/bugfix/ okhttp protocol: reliably mark trimmed content because of content limit (#757)
/!breaking!/ urlbuffer code in a separate package + 2 new implementations (#764)
Crawl-delay handling: allow `fetcher.max.crawl.delay` exceed 300 sec.(#768)
okhttp protocol: HTTP request header lacks protocol name and version (#775)
Locking mechanism for Metadata objects (#781)

LangID

/bugfix/ langID parse filter gets stuck (#758)

Elasticsearch

/bugfix/ Fix NullPointerException in JSONResourceWrappers (#760)
ES specify field used for grouping the URLs explicitly in mapping (#761)
Use search after for pagination in HybridSpout (#762)
Filter queries in ES can be defined as lists (#765)
es.status.bucket.sort.field can take a list of values (#766)
Archetype for SC+Elasticsearch (#773)
ES merge seed injection into crawl topology (#778)
Kibana - change format of templates to ndjson (#780)
/bugfix/ HybridSpout get key for results when prefixed by "metadata." (#782)
AggregationSpout to store sortValues for the last result of each bucket (#783)
Import Kibana dashboards using the API (#785)
Include Kibana script and resources in ES archetype (#786)

One of the main improvements in 1.16 is the addition of a Maven archetype to generate a crawl topology using Elasticsearch as a backend (#773). This is done by calling

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-elasticsearch-archetype -DarchetypeVersion=LATEST

The generated project also contains a script and resources to load templates into Kibana.

The topology for Elasticsearch now includes the injection of seeds from a file, which was previously in a separate topology. These changes should help beginners get started with StormCrawler.

The previous release included URLBuffers, with just one simple implementation. Two new implementations have been added in #764. The brand new PriorityURLBuffer sorts the buckets by the number of acks they got since the last sort whereas the SchedulingURLBuffer tries to guess when a queue should release a URL based on how long it took its previous URLs to be acked on average. The former has been used extensively with the HybridSpout but the latter is still experimental.

Finally, we added a soft locking mechanism to Metadata (#781) to help trace the source of ConcurrentModificationExceptions. If you are experiencing such exceptions, calling metadata.lock() when emitting e.g.

collector.emit(StatusStreamName, tuple, new Values(url, metadata.lock(), Status.FETCHED))

will trigger an exception whenever the metadata object is modified somewhere else. You might need to call unlock() in the subsequent bolts.

This does not change the way the Metadata works but is just there to help you debug.

Hopefully, we should be able to release 2.0 in the next few months. In the meantime, happy crawling and a massive thank you to all contributors!

Thursday 19 September 2019

What's new in StormCrawler 1.15?

StormCrawler 1.15 was released yesterday and as usual, contains loads of improvements and bugfixes.

You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/25?closed=1

We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.

Dependency upgrades

Storm 1.2.3 (#743)
JSOUP 1.12.1 (#741)
ES 7.3.0 (#742)
Tika 1.22 (#726)

Core

/bugfix/ CharsetIdentification crashes on binary content (#747)
FetcherBolt skips tuples which have spent too much time in queues (#746)
Fetcher bolts generate metrics for HTTP status (#745)
improvements to URLFilterBolt (#740)
/bugfix/ FetcherBolt doesn't recover when entering maxNumberURLsInQueues (#738)
/bugfix/ RemoteDriverProtocol does not set user agent correctly (#735)
Force English Locale for SimpleDateFormat in cookie converter (#732)

LangID

LangId normalises and returns value found via extraction (#733)

Elasticsearch

Pluggable URLBuffer and Hybrid Elasticsearch spout (#752)

ES spouts control how long the search is allowed to take with timeout (#753)

Improve types used for numeric values for metrics mappings (#744)

Use sniffer for ES connections (#734)

ScrollSpout to quit logging when finished (#727)

ES spouts use nextFetchDate RangeQuery as a filter (#725)

MetricsConsumer takes an optional date format (#724)

StatusMetricsBolt returns a max of 10K results per status (#723)

Happy crawling and thanks to all contributors!

Monday 13 May 2019

What's new in StormCrawler 1.14

StormCrawler 1.14 was released yesterday and as usual, contains loads of improvements and bugfixes.

You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/24?closed=1

This release contains a number of breaking changes, mostly related to the move to Elasticsearch 7. We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.

Dependency upgrades

crawler-commons 1.0 #693
okhttp 3.14.0 #692
guava 27.1 (#702)
icu4j 64.1 #702)
httpclient 4.5.8 #702)
Snakeyaml 1.24 #702)
wiremock 2.22.0 #702)
rometools 1.12.0 #702)
Elasticsearch 7.0.0 (#708)

Core

Track how long a spout has been without any URLs in its buffer (#685)
Change ack mechanism for StatusUpdaterBolts (#689)
Robots URL filter to get instructions from cache only (#700)
Allow indexing under canonical URL if in the same domain, not just host (#703)
/bugfix/ URLs ending with a space are fetched over and over again (#704)
ParseFilter to normalise the mime-type of documents into simple values (#707)
Robot rules should check the cache in case of a redirection (#709)
/bugfix/ Fix the logic around sitemap = false (#710)
Reduce logging of exceptions in FetcherBolt (#719)

Elasticsearch

Asynchronous spouts (i.e ES) can send queries after max delay since previous one ended (#683)
StatusUpdaterBolt to load config from non-default param names (#687)
Add a ScrollSpout to read all the documents from a shard (#688 and #690) - see in our guest post how this can be used to reindex a status index.
ES IndexerBolt : check success of batches before acking tuples (#647)
/bugfix/ URLs with content that breaks ES get refetched over and over again (#705)
/bugfix/ URLs without valid host name (and routing) stay DISCOVERED forever (#706)
/bugfix/ ESSeedInjector: no URLs injected because URL filter does not subscribe to status stream (#715)
MetricsConsumer to include topology ID in metrics(#714)

WARC

Generate WARC request records (#509)
WARC format improvements (#691)

Tika

Set mimetype whitelist for Tika Parser (#712)

*********

I will be running a workshop on StormCrawler next month at the Web Archiving Conference in Zagreb and give a presentation jointly with Sebastian Nagel of CommonCrawl. I will come with loads of presents generously given by our friends at Elastic.

As usual, thanks to all contributors and users.

Happy crawling!

Sunday 6 January 2019

What's new in StormCrawler 1.13

Happy new year!

I have just released StormCrawler 1.13, which contains important bug fixes and some nice improvements.

As usual, we advise users to upgrade to this version.

Dependency upgrades

Tika 1.20 (#676)

Xerces 2.12.0 (#672)

Guava 27.0.1 (#672)

Elasticsearch 6.5.3 (#672)

Jackson 2.8.11.3 (14e44)

Core

FileSpout uses StringTabScheme by default (#664)

JSoupParserBolt outlink limit per page (#670)

/BUGFIX/ Date format used for HTTP if-modified-since requests must follow RFC7231 (#674)

/BUGFIX/ DeletionBolt expects Metadata from tuples (#675)

Added configurable TextExtractor to JSoupParserBolt (#678)

!BREAKING! Core Spouts should use status stream if withDiscoveredStatus is set to true (#677)

SQL

SQL IndexerBolt (#608)

Archetype

Archetype sets StormCrawler version in a property (#668)

Replace ContentFilter with TextExtractor (#678)

Apart from the changes to the core spouts (#664 and #677), the main new feature is the addition of the TextExtractor (#678) for the JsoupParserBolt. Unlike the ContentParseFilter, which it replaces, it is configured from the main configuration and is not a ParseFilter as it operates directly on the objects generated by Jsoup. The TextExtractor allows restricting the text to specific elements to avoid boilerplate code and navigation elements but provides a far cleaner text content compared to the ContentParseFilter which merges some tokens. The TextExtractor can also be used to define exclusion zones which will be applied either to the restricted zones or the whole document if no such zone were defined or found. This is useful for instance to remove SCRIPT or STYLE elements.

As usual, thanks to all contributors and users, and particularly the Government of Northwest Territories in Canada who kindly donated some of the code of the TextExtractor.

Happy crawling!

Thursday 22 November 2018

What's new in StormCrawler 1.12

The previous release was only last month but I decided to ship this one now as it contains several bugfixes and improvements which many users would benefit from.

As you can see below, the main changes are around protocols and sitemaps. We have used Selenium and OKHTTP a lot recently to deal with dynamic websites and the changes below definitely help for these. There is also an important bugfix for JSOUP (#653) and various other improvements.

As usual, we advise users to upgrade to this version.

Dependency upgrades

JSOUP 1.11.3 (#663)
Elasticsearch 6.5.0 (#661)
Jackson and Wiremock dependencies (#640)

Core

Post JSON data with OKHTTP protocol via metadata (#641)
Selenium RemoteDriverProtocol triggered by K/V in metadata (#642)
SeleniumProtocol NavigationFilters not reached in case of a redirection (#643)
Limit crawl to URLs found in sitemaps (#645)
spout.reset.fetchdate.after based on time when query was set to NOW (#648)
Avoid StackOverflowError when generating DocumentFragment from JSOUP (#653)
redirected sitemaps don't have isSitemap=true (#660)
Staggered scheduling of sitemap URLs (#657)
Scheduling -> round to the closest second, minute or hour (#654)
FetcherBolt don't add discovered sitemaps if the robots rules do not allow them (#662)

WARC

WARC record format: trailing zero byte causes WARC parser to fail (#652)

Elasticsearch

ES IndexerBolt track number of batch sent (#540)
Rename index index into docs (#649)
ES StatusMetricsBolt generate metrics for total number of docs (#651)

Coming next...

The release of Storm 2.0.0 has taken longer than expected, which is partly my fault as I reported a number of issues. These issues have now been fixed and hopefully, 2.0.0 will be out soon. As mentioned last month, there's a branch of StormCrawler which works on the Storm 2.x branch. Give it a try if you want to be on the cutting edge!

Finally, there will be a StormCrawler workshop in Vilnius next week. I am sure tickets are still available if you fancy a last minute trip to Lithuania.

As usual, thanks to all contributors and users. Happy crawling!

UPDATE

There were 2 bugs in release 1.12 which have been fixed in 1.12.1, see details on

https://github.com/DigitalPebble/storm-crawler/milestone/23?closed=1

Thursday 18 October 2018

What's new in StormCrawler 1.11

I've just released StormCrawler 1.11, here are the main changes, some of which require modifications of your configuration.

Users should upgrade to this version as it fixes several bugs and adds loads of functionalities.

Dependency upgrades

Tika 1.19.1 (#606)
Elasticsearch 6.4.1 (#607)
SOLR 7.5 (#624)
OKHttp 3.11.0

Core

/bugfix/ FetcherBolts original metadata overwrites metadata returned by protocol (#636)
Override Globally Configured Accepts and Accepts-Language Headers Per-URL (#634)
Support for cookies in okhttp implementation (#632)
AbstractHttpProtocol uses StringTabScheme to parse input into URL and Metadata (#631)
Improve MimeType detection for interpreted server-side languages (#630)
/bugfix/ Custom intervals in Scheduler can't contain dots (#616)
OKHTTP protocol trust all SSL certificates (#615)
HTTPClient protocol setDefaultMaxPerRoute based on max threads per queue (#594)
Fetcher Added byteLength to Metadata (#599)
URLFilters + ParseFilters refactoring (#593)
HTTPClient Add simple basic auth system (#589)

WARC

/bugfix/ WARCHdfsBolt writes zero byte files (#596)

SOLR

SOLR StatusUpdater use short status name (#627)
SOLRSpout log queries, time and number of results (#623)
SOLR spout - reuse nextFetchDate (#622)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

SQL

MetricsConsumer (#612)
Batch PreparedStatements in SQL status updater bolt, fixes (#610)
QLSpout group by hostname and get top N results (#609)
Harmonise param names for SQL (#619)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

Elasticsearch

/bugfix/ NPE in AggregationSpout when there is not any status index created (#597)
/bugfix/ NPE in CollapsingSpout (#595)
Added ability to implement custom indexes names based on metadata information (#591)
StatusMetricsBolt - Added check for avoid NPE when interacting with multi search response (#598)
Change default value of es.status.reset.fetchdate.after (#590)
Log error if elastic search reports an unexpected problem (#575)
ES Wrapper for URLFilters implementing JSONResource (#588)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

As you've probably noticed, #617 affects ES, SOLR as well as SQL. The idea behind it is that the spout in these modules have a lot in common as they all query a backend for URLs to fetch. We moved some of the functionalities to a brand new class AbstractQueryingSpout, which greatly reduces the amount of code. The handling of the URL caching, TTL for the purgatory and min delay between queries is now done in that class. As a result, the spouts implementations have less to do and can focus on the specifics of getting the data from their respective backends. A nice side effect is that the SQL and SOLR spouts now benefit from some of the functionalities which were up to now only available in ES.

You will need to update your configuration to replace the elements which were specific to ES by the generic ones i.e. spout.reset.fetchdate.after, spout.ttl.purgatory and spout.min.delay.queries. These are also used by SOLR and SQL.

Please note that these changes also impact some of the metrics names.

Coming next...

Storm 2.0.0 should be released soon, which is very exciting! There's a branch of StormCrawler which anticipates some of the changes, even though it hasn't been tested much yet. Give it a try if you want to be on the cutting edge!

I expect the SOLR and SQL backends to get further improvements and progressively catch up with our Elasticsearch resources.

Finally, our Bristol workshop next month is now full but there is one in Vilnius on 27/11. I'll also give a talk there the following day. If you are around, come and say hi and get yourself a StormCrawler sticker.

As usual, thanks to all contributors and users. Happy crawling!

Thursday 14 June 2018

What's new in StormCrawler 1.10

StormCrawler 1.9 is only a couple of weeks old but the new functionalities added since justify a new release.

Dependency upgrades

Apache Storm 1.2.2 (#583)
Crawler-Commons 0.10 (#580)
Elasticsearch 6.3.0 (#587)

Archetype

parsefilters: added CommaSeparatedToMultivaluedMetadata to split parse.keywords
bugfix: java topology in archetype does not use FeedParserBolt, fixes #551
bugfix: archetype - move SC dependency to first place to avoid STORM-2428, fixes #559

Elasticsearch

IndexerBolt set pipeline via config (#584)
Wrapper for loading JSON-based ParseFilters from ES (#569) - see below

Core

SimpleFetcherBolt to send URLs back to its own queue if time to wait above threshold (#582)
ParseFilter to tag a document based on pattern matching on its URL (#577)
New URL filter implementation based on JSON file and organised per hostname or domain #578

Let's have a closer look at some of the points above.

The CollectionTagger is a ParseFilter provides a similar functionality to what Collections are in Google Search Appliance, namely the ability to add a key value in the metadata based on the URL of a document matching one or more regular expressions. The rules are expressed in a JSON file and look like

{

"collections": [{

"name": "stormcrawler",

"includePatterns": ["http://stormcrawler.net/.+"]

{

"name": "crawler",

"includePatterns": [".+crawler.+", ".+nutch.+"],

"excludePatterns": [".+baby.+", ".+spider.+"]

}

]

}

Please note that the format is different from what GSA does but it can achieve the same thing.

So far, nothing revolutionary, the resource file gets loaded from the uber-jar, just like any other resource. However, what we introduced at the same time is the interface JSONResource, which CollectionTagger implements. This interface defines how implementations load a JSON file to build their resources.

Here comes the interesting bit. We added a new resource for Elasticsearch in #569 called JSONResourceWrapper. As the name suggests, this wraps any ParseFilter implementing JSONResource and delegates the filtering to it. What it also does, is that it allows loading the JSON resource from an Elasticsearch document instead of the uber-jar and reloads it periodically. This allows you to update a resource without having to recompile the uber-jar and restart the topology.

The wrapper is configured in the usual way i.e via the parsefilter.json file, like so

{

"class": "com.digitalpebble.stormcrawler.elasticsearch.parse.filter.JSONResourceWrapper",

"name": "ESCollectionTagger",

"params": {

"refresh": "60",

"delegate": {

"class": "com.digitalpebble.stormcrawler.parse.filter.CollectionTagger",

"params": {

"file": "collections.json"

}

The JSONResourceWrapper also needs to know where Elasticsearch lives. This is set via the usual configuration file:

es.config.addresses: "localhost"

es.config.index.name: "config"

es.config.doc.type: "config"

es.config.settings:

cluster.name: "elasticsearch"

You can then push a modified version of the resources to Elasticsearch e.g. with CURL

curl -XPUT 'localhost:9200/config/config/collections.json?pretty' -H 'Content-Type: application/json' -d @collections.json

Another resource we introduced in this release is the FastURLFilter, which also implements JSONResource (but as there isn't a Wrapper for URLFilters yet, can't be loaded from ES). This is similar to the existing URL filter we have in that it allows to remove URLs based on regular expressions, however, it organises the rules per domain or hostname which makes it more efficient as a URL doesn't have to be checked against all the patterns, just the ones for its domain. There is even a scope based on metadata key/values, for instance, if some of your seeds were organised by collection, as well as a global scope which is tried for all URLs if nothing else matched.

The resource file looks like

[

{

"scope": "GLOBAL",

"patterns": [

"DenyPathQuery \\.jpg"

]

},

{

"scope": "domain:stormcrawler.net",

"patterns": [

"AllowPath /digitalpebble/",

"DenyPath .+"

]

},

{

"scope": "metadata:key=value",

"patterns": [

"DenyPath .+"

]

}

]

where the Query suffix indicates whether the pattern should be matched against the path + query element or just the path.

I hope you like this new release of StormCrawler and the new features it brings. I would like to thank all the users and contributors and particularly the Government of Northwest Territories in Canada who kindly donated some of the code of the CollectionTagger.

Happy Crawling!

Wednesday 5 May 2021

Monday 20 July 2020

Dependency upgrades

Core

WARC

Elasticsearch

Thursday 16 January 2020

Dependency upgrades

Core

LangID

Elasticsearch

Thursday 19 September 2019

Dependency upgrades

Core

LangID

Elasticsearch

Monday 13 May 2019

Dependency upgrades

Core

Elasticsearch

WARC

Tika

Sunday 6 January 2019

Dependency upgrades

Core

SQL

Archetype

Thursday 22 November 2018

Dependency upgrades

Core

WARC

Elasticsearch

Coming next...

UPDATE

Thursday 18 October 2018

Thursday 14 June 2018