DigitalPebble's Blog

Thursday, 16 January 2020

What's new in StormCrawler 1.16?

Happy new year!

StormCrawler 1.16 was released a couple of days ago. You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/26?closed=1

As usual, we recommend that all users upgrade to this version as it contains important fixes and performance improvements.

Dependency upgrades

Tika 1.23 (#771)
ES 7.5.0 (#770)
jackson-databind from 2.9.9.2 to 2.9.10.1 dependency (#767)

Core

OKHttp configure authentication for proxies (#751)
Make URLBuffer configurable + AbstractURLBuffer uses URLPartitioner (#754)
/bugfix/ okhttp protocol: reliably mark trimmed content because of content limit (#757)
/!breaking!/ urlbuffer code in a separate package + 2 new implementations (#764)
Crawl-delay handling: allow `fetcher.max.crawl.delay` exceed 300 sec.(#768)
okhttp protocol: HTTP request header lacks protocol name and version (#775)
Locking mechanism for Metadata objects (#781)

LangID

/bugfix/ langID parse filter gets stuck (#758)

Elasticsearch

/bugfix/ Fix NullPointerException in JSONResourceWrappers (#760)
ES specify field used for grouping the URLs explicitly in mapping (#761)
Use search after for pagination in HybridSpout (#762)
Filter queries in ES can be defined as lists (#765)
es.status.bucket.sort.field can take a list of values (#766)
Archetype for SC+Elasticsearch (#773)
ES merge seed injection into crawl topology (#778)
Kibana - change format of templates to ndjson (#780)
/bugfix/ HybridSpout get key for results when prefixed by "metadata." (#782)
AggregationSpout to store sortValues for the last result of each bucket (#783)
Import Kibana dashboards using the API (#785)
Include Kibana script and resources in ES archetype (#786)

One of the main improvements in 1.16 is the addition of a Maven archetype to generate a crawl topology using Elasticsearch as a backend (#773). This is done by calling

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-elasticsearch-archetype -DarchetypeVersion=LATEST

The generated project also contains a script and resources to load templates into Kibana.

The topology for Elasticsearch now includes the injection of seeds from a file, which was previously in a separate topology. These changes should help beginners get started with StormCrawler.

The previous release included URLBuffers, with just one simple implementation. Two new implementations have been added in #764. The brand new PriorityURLBuffer sorts the buckets by the number of acks they got since the last sort whereas the SchedulingURLBuffer tries to guess when a queue should release a URL based on how long it took its previous URLs to be acked on average. The former has been used extensively with the HybridSpout but the latter is still experimental.

Finally, we added a soft locking mechanism to Metadata (#781) to help trace the source of ConcurrentModificationExceptions. If you are experiencing such exceptions, calling metadata.lock() when emitting e.g.

collector.emit(StatusStreamName, tuple, new Values(url, metadata.lock(), Status.FETCHED))

will trigger an exception whenever the metadata object is modified somewhere else. You might need to call unlock() in the subsequent bolts.

This does not change the way the Metadata works but is just there to help you debug.

Hopefully, we should be able to release 2.0 in the next few months. In the meantime, happy crawling and a massive thank you to all contributors!

Thursday, 19 September 2019

What's new in StormCrawler 1.15?

StormCrawler 1.15 was released yesterday and as usual, contains loads of improvements and bugfixes.

You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/25?closed=1

We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.

Dependency upgrades

Storm 1.2.3 (#743)
JSOUP 1.12.1 (#741)
ES 7.3.0 (#742)
Tika 1.22 (#726)

Core

/bugfix/ CharsetIdentification crashes on binary content (#747)
FetcherBolt skips tuples which have spent too much time in queues (#746)
Fetcher bolts generate metrics for HTTP status (#745)
improvements to URLFilterBolt (#740)
/bugfix/ FetcherBolt doesn't recover when entering maxNumberURLsInQueues (#738)
/bugfix/ RemoteDriverProtocol does not set user agent correctly (#735)
Force English Locale for SimpleDateFormat in cookie converter (#732)

LangID

LangId normalises and returns value found via extraction (#733)

Elasticsearch

Pluggable URLBuffer and Hybrid Elasticsearch spout (#752)

ES spouts control how long the search is allowed to take with timeout (#753)

Improve types used for numeric values for metrics mappings (#744)

Use sniffer for ES connections (#734)

ScrollSpout to quit logging when finished (#727)

ES spouts use nextFetchDate RangeQuery as a filter (#725)

MetricsConsumer takes an optional date format (#724)

StatusMetricsBolt returns a max of 10K results per status (#723)

Happy crawling and thanks to all contributors!

Friday, 24 May 2019

Reindexing StormCrawler's URL Status Index

StormCrawler holds the frontier and the page fetch status in the "status" index. It's a sharded Elasticsearch index (in the most common setup) and every document contains the page URL, the fetch status, the time when the URL should be fetched, and further information as metadata. The index uses the SHA-256 of the URL as id and is usually sharded. How URLs are distributed among shards is directly related to crawler politeness. A crawler needs to limit the load to a particular server. If all URLs of the same host or domain are kept in the same shard, the spouts (we run one spout per shard) can already control that only a limited number of URLs per host or domain is emitted into the topology during a certain time interval. A controlled flow of URLs supports the fetcher bolt which ultimatively enforces politeness and guarantees a minimum delay between successive requests sent to the same server.

Motivation

The status index grows over time. Sooner or later, unless you decide to delete it and start the crawler from scratch, you hit one of the possible reasons why to reindex it:

change the number of shards (to allow the index growing)
fix domain names acting as sharding key (see the discussion in #684)
strip metadata to save storage space
apply current URL filters and normalization rules to all URLs in the status index
remove the document type – previously the documents in the "status" index were of type "status". In Elasticsearch 7.0 document types are deprecated and consequently SC 1.14 also removed the document types in all indexes (status, metrics, content)
merge/append indexes
pull or push to another ES cluster
... (there are some more good reasons, for sure)

In my case it was a combination of points 1, 2, 3 and 5 applied to a status index holding 250 million URLs of Common Crawl's news crawler. At the same time, the server has been upgraded and I had to move the index to a new machine (point 7). Enough reasons to try the reindexing topology which has been introduced in StormCrawler 1.14 (#688). Of course, some of the points (but not all) could also be achieved by Elasticsearch standard tools.

Configure the Reindex Topology

There are multiple scenarios possible: both Elasticsearch indexes are on the same machine or cluster (using index aliases), or you pull or push from/to a remote index. My decision was to run the reindex topology on the machine hosting the new index and pull the data from the old index. To easily get around the Elasticsearch authentication, I've simply forwarded the remote ES port to the local machine running ssh -R 9201:localhost:9200 <ip_new> -fN & on the old machine. The old index is now visible on port 9201 on the new machine.

In any case, you need to first upgrade Elasticsearch on the target and source machine/cluster to version 7.0 required by StromCrawler 1.14. Or in other words, both Elasticsearch indexes must be compatible with the Elasticsearch client used by StormCrawler running the reindexing topology.

The topology is easily configured as Storm Flux:

    name: "reindexer"
    
    includes:
      - resource: true
        file: "/crawler-default.yaml"
        override: false
    
      - resource: false
        file: "crawler-conf.yaml"
        override: false   # let the properties defined in the flux take precedence
    
      - resource: false
        file: "es-conf.yaml"
        override: false
    
    config:
      # topology settings
      topology.max.spout.pending: 600
      topology.workers: 1
      # old index (remote, forwarded to port 9201)
      es.status.addresses: "http://localhost:9201"
      es.status.index.name: "status"
      es.status.routing.fieldname: "metadata.hostname"
      es.status.concurrentRequests: 1
      # new index (local)
      es.status2.addresses: "localhost"
      es.status2.index.name: "status"
      es.status2.routing: true
      es.status2.routing.fieldname: "metadata.hostname"
      es.status2.bulkActions: 500
      es.status2.flushInterval: "1s"
      es.status2.concurrentRequests: 1
      es.status2.settings:
        cluster.name: "elasticsearch"
    
    spouts:
    #  - id: "filter"
    #    className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
    #    parallelism: 10
      - id: "spout"
        className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.ScrollSpout"
        parallelism: 10   # must be equal to number of shards in the old index
    
    bolts:
      - id: "status"
        className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
        parallelism: 4
        constructorArgs:
          - "status2"
    
    streams:
      - from: "spout"
    #   to: "filter"
    #   grouping:
    #     type: FIELDS
    #     args: ["url"]
    #     streamId: "status"
    # - from: "filter"
        to: "status"
        grouping:
          streamId: "status"
          type: CUSTOM
          customClass:
            className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
            constructorArgs:
              - "byDomain"

The parts to plug in an URL filter bolt are commented out. Of course, you should review all settings carefully so that they fit your situation. One important point is the number of spouts which must be equal to the number of shards in the old index.

After some failed trials I've decided to choose defensive performance settings:

a not too high number of tuples pending in the topology: topology.max.spout.pending: 600. With 10 spouts there are max. 6000 pending tuples allowed.
four status bolts without concurrent requests (es.status2.concurrentRequests: 1).

To speed up the reindexing you might also change the refresh interval of the new Elasticsearch index by calling:

    curl -H Content-Type:application/json -XPUT 'http://localhost:9200/status/_settings' \
        --data '{"index" : {"refresh_interval" : -1 }}'

Don't forget to set it back to the default if the reindexing is done:

    curl ... --data '{"index" : {"refresh_interval" : null }}'

Running the Topology

The reindex topology is started as Flux via:

    java -cp crawler-1.14.jar org.apache.storm.flux.Flux --remote reindex-flux.yaml

Depending on the size of your index it might run longer. I've achieved 6,000 documents reindexed per second which means that the entire 250 million docs are reindexed after 12 hours.

Verifying the New Status Index

The topology has finished now let's check the new index and whether everything has been reindexed. Let's first get the metrics of the old status index (on port 9201):

%> curl -H Content-Type:application/json -XGET 'http://localhost:9201/_cat/indices/status?v'
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   status xFN4EGHYRaGw8hi4qxMWrg  10   1  250305363      6038664      101gb          101gb

and compare it with those of the new status index:

%> curl -H Content-Type:application/json -XGET 'http://localhost:9200/_cat/indices/status?v'
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   status ZFNT8FovT4eTl3140iOUvw  16   1  250278257         5938     87.2gb         87.2gb

Great! The new index now has 16 shards and 10 GB storage have been saved by stripping unneeded metadata. All 250 million documents have been reindexed. But stop: it's not all documents – a few thousand are missing! Panic! What's going on? Checked the logs for errors – nothing. Also: why there are deleted documents in the new index?

Ok, let's first calculate the loss: 250305363 – 250278257 = 27106 (0.01%). Well, probably not worth to redo the procedure, either the links are outdated or the crawler will find them again. Anyway, I was interested to figure out the reason. But how to find the missing URLs? – It's not trivial to compare two lists of 250 million items.

The solution is to get first the counts of items per domain aggregating counts on the field "metadata.hostname" which is used for routing documents to shards. The idea is to find the domains where the counts differ and then compare only the per-domain lists. Let's do it:

curl -s -H Content-Type:application/json -XGET http://localhost:9200/status/_search --data '{
  "aggs" : {
    "agg" : {
      "terms" : {
        "field" : "metadata.hostname",
        "size" : 100000
      }
    }
  }
}' | jq --raw-output '.aggregations.agg.buckets[] | [(.doc_count|tostring),.key] | join("\t")' \
   | rev | sort | rev >domain_counts_new_index.txt

A short explanation what this command does:

we send an aggregation query to the Elasticsearch index which counts documents per domain name (kept in "metadata.hostname")
the JSON output is processed by jq and the output is written into a text file with two columns – count and domain name
the list is sorted using reversed strings – this keeps the counts together per top-level domain, domain, subdomain which makes the comparison easier

The procedure to get the per-domain counts from the old index is the same. After we have both lists we can compare them side by side:

%> diff -y domain_counts_old_index.txt domain_counts_new_index.txt
...
4       athletics.africa              | 80      athletics.africa
76      www.athletics.africa          <
108     chri.ca                         108     chri.ca
2       www.alta-frequenza.corsica    | 2       alta-frequenza.corsica
2       corsenetinfos.corsica         | 3       corsenetinfos.corsica
1       www.corsenetinfos.corsica     <
...

Already the first difference brought up the reason for the differences between the old and new index! You remember, there was this issue with domain names changing over time with different versions of the public suffix list? It was one of the reasons why the reindexing topology had been introduced, see #684 and news-crawler#28). If the routing key is not stable it might happen that the same URL is indexed twice with different routing keys in two shards. Exactly this happened for some of the domains affected by this issue. Here is one example:

6       hr.de                |  19      hr.de
2       reportage.hr.de      <
1       daserste.hr.de       <
11      www.hr.de            <

In the new index there are only 19 items although the sum of 6 + 2 + 1 + 11 is 20. I've checked the remaining differences and the assumption has proven true: affected are only domain names either with recently introduced TLDs (.africa has been registered by ICANN in 2017) or misclassified suffixes (co.uk is a valid suffix but hr.de is not).

Everything is fine now and the only open point is to bring the new index into production and restart the crawl topology on the new machine!

Monday, 13 May 2019

What's new in StormCrawler 1.14

StormCrawler 1.14 was released yesterday and as usual, contains loads of improvements and bugfixes.

You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/24?closed=1

This release contains a number of breaking changes, mostly related to the move to Elasticsearch 7. We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.

Dependency upgrades

crawler-commons 1.0 #693
okhttp 3.14.0 #692
guava 27.1 (#702)
icu4j 64.1 #702)
httpclient 4.5.8 #702)
Snakeyaml 1.24 #702)
wiremock 2.22.0 #702)
rometools 1.12.0 #702)
Elasticsearch 7.0.0 (#708)

Core

Track how long a spout has been without any URLs in its buffer (#685)
Change ack mechanism for StatusUpdaterBolts (#689)
Robots URL filter to get instructions from cache only (#700)
Allow indexing under canonical URL if in the same domain, not just host (#703)
/bugfix/ URLs ending with a space are fetched over and over again (#704)
ParseFilter to normalise the mime-type of documents into simple values (#707)
Robot rules should check the cache in case of a redirection (#709)
/bugfix/ Fix the logic around sitemap = false (#710)
Reduce logging of exceptions in FetcherBolt (#719)

Elasticsearch

Asynchronous spouts (i.e ES) can send queries after max delay since previous one ended (#683)
StatusUpdaterBolt to load config from non-default param names (#687)
Add a ScrollSpout to read all the documents from a shard (#688 and #690) - see in our guest post how this can be used to reindex a status index.
ES IndexerBolt : check success of batches before acking tuples (#647)
/bugfix/ URLs with content that breaks ES get refetched over and over again (#705)
/bugfix/ URLs without valid host name (and routing) stay DISCOVERED forever (#706)
/bugfix/ ESSeedInjector: no URLs injected because URL filter does not subscribe to status stream (#715)
MetricsConsumer to include topology ID in metrics(#714)

WARC

Generate WARC request records (#509)
WARC format improvements (#691)

Tika

Set mimetype whitelist for Tika Parser (#712)

*********

I will be running a workshop on StormCrawler next month at the Web Archiving Conference in Zagreb and give a presentation jointly with Sebastian Nagel of CommonCrawl. I will come with loads of presents generously given by our friends at Elastic.

As usual, thanks to all contributors and users.

Happy crawling!

Monday, 11 February 2019

Meet StormCrawler users: Q&A with Pixray (Germany)

We are opening a series of Q&A blogs with Maik Piel telling us about the use of StormCrawler at Pixray.

Q: What do you guys do at Pixray? Why do you need web crawling?

We are experts in image tracking on the web. We work for image rights holders to protect their pictures on the web as well as brands and manufacturers to monitor sales channels. Our customers range from news agencies and picture agencies, individual photographers, e-commerce companies to luxury brands. Web crawling is one of the core buildings blocks of our platform - next to a massive picture matching platform, various APIs and our customer portals.

Q: What sort of crawls do you do? How big are they?

We do three kinds of scans: broad scans across complete regions of the web (like the EU or North America), deep scans on single domains and also near-realtime discovery scans on thousands of selected domains. For all of these different scans, we employ customized versions of StormCrawler to match the very distinct requirements in crawling patterns. Obviously, the biggest crawls are the broad regional scans, including more than 10 billion URLs and tens of millions of different domains.

Q: What software stack do you use? e.g. SC + ES + Grafana? Hardware used?

Adapted and extended versions of StormCrawler as well as Elasticsearch and Kibana. We couple our crawling infrastructure with the rest of our platform through RabbitMQ. Our crawler is built on Ubuntu servers, with 32 GB of RAM and Intel Core I7 and 4 TB of disk space. Each runs Apache Storm and Elasticsearch. In the future, we will split the storage (Elasticsearch) and the computation (Storm) layers to separate hardware. We are also looking at options to employ container and service orchestration frameworks to scale our crawler infrastructure dynamically.

Q: Why did you choose StormCrawler?

We initially built our crawler on Apache Nutch. Needless to say that Nutch is a great and robust platform. But once you grow beyond a certain point you start to see limitations. The biggest limitation is the low responsiveness to changes and the uneven system utilization due to the long generate/crawl/update cycles. It sometimes took us 24 hours or more till we could see the effects of a change we made to the software. Furthermore, we found that it is a bit troublesome to get valid statistics data from Nutch in real time. StormCrawler solves all that for us. Every config or code change that we commit shows its effect immediately and you get statistics very, very easily. There is no long-cycle batching anymore in StormCrawler which gives us a very even and continuous crawling, reducing our need for massive queuing of results to ensure an even utilization of down-stream infrastructure. Kibana gives us great real-time insights into the crawl database. With Nutch, we had to run analysis jobs of around 4 hours, even if we just needed the status of a single url.

Q: What do you like the most / least in StormCrawler?

Besides the points mentioned above, we have to praise StormCrawlers extensibility. In our different setups we have both made changes to existing code in the StormCrawler project but also wrote large amounts of own code. The structure Apache Storm imposes is great. Components are very cleanly decoupled and it is easy to introduce custom functionality by just writing new Spouts and Bolts and linking them into the topology. For our use case we, of course, had to deal with pictures - which StormCrawler itself does not do. We just created our own Bolts for that. For our near-realtime discovery crawler, we needed an engine that calculates the revisit date for a URL based on various factors instead of a static value, again we could just create a specific spout for that.

Q: Anything in particular you'd like to have in a future release?

It would be great to have a built-in way to prioritize different TLDs within the StormCrawler spouts. We have built a custom solution for that which we might contribute back to StormCrawler at some point.

Sunday, 6 January 2019

What's new in StormCrawler 1.13

Happy new year!

I have just released StormCrawler 1.13, which contains important bug fixes and some nice improvements.

As usual, we advise users to upgrade to this version.

Dependency upgrades

Tika 1.20 (#676)

Xerces 2.12.0 (#672)

Guava 27.0.1 (#672)

Elasticsearch 6.5.3 (#672)

Jackson 2.8.11.3 (14e44)

Core

FileSpout uses StringTabScheme by default (#664)

JSoupParserBolt outlink limit per page (#670)

/BUGFIX/ Date format used for HTTP if-modified-since requests must follow RFC7231 (#674)

/BUGFIX/ DeletionBolt expects Metadata from tuples (#675)

Added configurable TextExtractor to JSoupParserBolt (#678)

!BREAKING! Core Spouts should use status stream if withDiscoveredStatus is set to true (#677)

SQL

SQL IndexerBolt (#608)

Archetype

Archetype sets StormCrawler version in a property (#668)

Replace ContentFilter with TextExtractor (#678)

Apart from the changes to the core spouts (#664 and #677), the main new feature is the addition of the TextExtractor (#678) for the JsoupParserBolt. Unlike the ContentParseFilter, which it replaces, it is configured from the main configuration and is not a ParseFilter as it operates directly on the objects generated by Jsoup. The TextExtractor allows restricting the text to specific elements to avoid boilerplate code and navigation elements but provides a far cleaner text content compared to the ContentParseFilter which merges some tokens. The TextExtractor can also be used to define exclusion zones which will be applied either to the restricted zones or the whole document if no such zone were defined or found. This is useful for instance to remove SCRIPT or STYLE elements.

As usual, thanks to all contributors and users, and particularly the Government of Northwest Territories in Canada who kindly donated some of the code of the TextExtractor.

Happy crawling!

Thursday, 22 November 2018

What's new in StormCrawler 1.12

The previous release was only last month but I decided to ship this one now as it contains several bugfixes and improvements which many users would benefit from.

As you can see below, the main changes are around protocols and sitemaps. We have used Selenium and OKHTTP a lot recently to deal with dynamic websites and the changes below definitely help for these. There is also an important bugfix for JSOUP (#653) and various other improvements.

As usual, we advise users to upgrade to this version.

Dependency upgrades

JSOUP 1.11.3 (#663)
Elasticsearch 6.5.0 (#661)
Jackson and Wiremock dependencies (#640)

Core

Post JSON data with OKHTTP protocol via metadata (#641)
Selenium RemoteDriverProtocol triggered by K/V in metadata (#642)
SeleniumProtocol NavigationFilters not reached in case of a redirection (#643)
Limit crawl to URLs found in sitemaps (#645)
spout.reset.fetchdate.after based on time when query was set to NOW (#648)
Avoid StackOverflowError when generating DocumentFragment from JSOUP (#653)
redirected sitemaps don't have isSitemap=true (#660)
Staggered scheduling of sitemap URLs (#657)
Scheduling -> round to the closest second, minute or hour (#654)
FetcherBolt don't add discovered sitemaps if the robots rules do not allow them (#662)

WARC

WARC record format: trailing zero byte causes WARC parser to fail (#652)

Elasticsearch

ES IndexerBolt track number of batch sent (#540)
Rename index index into docs (#649)
ES StatusMetricsBolt generate metrics for total number of docs (#651)

Coming next...

The release of Storm 2.0.0 has taken longer than expected, which is partly my fault as I reported a number of issues. These issues have now been fixed and hopefully, 2.0.0 will be out soon. As mentioned last month, there's a branch of StormCrawler which works on the Storm 2.x branch. Give it a try if you want to be on the cutting edge!

Finally, there will be a StormCrawler workshop in Vilnius next week. I am sure tickets are still available if you fancy a last minute trip to Lithuania.

As usual, thanks to all contributors and users. Happy crawling!

UPDATE

There were 2 bugs in release 1.12 which have been fixed in 1.12.1, see details on

https://github.com/DigitalPebble/storm-crawler/milestone/23?closed=1