Showing posts with label storm-crawler. Show all posts
Showing posts with label storm-crawler. Show all posts

Thursday 23 November 2023

Meet the StormCrawler users: Q&A with the OpenWebSearch.eu project

It has been a while since our first “Meet the StormCrawler users” blog and since StormCrawler is still going strong and used by a wide variety of users, we are delighted to put the spotlight on one of the most exciting projects that uses it. Our guests today are  Michael Dinzinger and Saber Zerhoudi, both from the University of Passau in Germany.

  

Can you please introduce yourselves and the project you are working on?

Hello, we are Saber and Michael, both PhD students in Passau. Since September 2022, we have been working on OpenWebSearch.eu, a European research project, in which people from now more than 15 participating institutes collaborate on building an Open Web Index.

Our task here at Uni Passau is the collaborative and resource-efficient crawling, which is the first technical step in building the Index (see figure below). The end result are Metadata and Index files, currently in Parquet and CIFF format. These are hosted on the project partners’ shared infrastructure and will soon be available for download.

By providing these files to our users, we want to empower them to work on new search applications and tap the web as a resource for their research and business ideas. The Open Web Index is in this sense a truly open, transparent and legally compliant alternative to the proprietary Web Indices of the big tech gatekeepers.

How do you use StormCrawler and URLFrontier?

We use StormCrawler to build our own crawling pipelines by configuring - and in some cases extending - the already existing software components. We particularly appreciate its high customizability, because we use the framework for classic discovery crawling, which we need to feed the Open Web Index, and also for more task-specific and research-oriented crawling.

A major challenge in our work is the heterogeneous infrastructure, on top of which we are building the crawling system. The different infrastructure partners in the project provide a large set of commodity hardware, which is hosted across different datacenters and dispersed over Europe. Despite the geographic distribution of the machines, all nodes should collaborate on the same shared crawl. For that purpose, we deploy URLFrontier in a central computing site. The Frontier services distribute the crawl space and communicate with the remote crawlers in order to provide them with a continuous flow of URLs to be fetched (see figure below). URLFrontier can use different backends to store the data, we chose to use one leveraging another open source project, OpenSearch.

What results did you get so far?

The crawling is currently still in its experimental phase, but fortunately, we have already achieved some interesting and promising numbers. For example, we are running three StormCrawler instances at the moment. These have fetched over 200M web pages within a single week and each of them produced between 200 and 250 GiB of WARC files per day. The crawled data is filtered and enriched with meta information, before it is provided as index and metadata files to the public. In the next steps, we want to upscale the crawling to several terabytes and improve the prioritisation of crawl URLs to get a strong focus on high-quality pages.

It is definitely worth mentioning that the WARC module of StormCrawler helped us a lot. In order to get our indexing pipeline going, we started with copying WARC files from CommonCrawl, before we were able to crawl on our own.

Why did you choose StormCrawler?

We chose StormCrawler primarily for its compatibility with URLFrontier. This synergy made it an excellent starting point for developing a large-scale, coordinated, and distributed crawling cluster. Additionally, the open-source nature of the project and its active community influenced our decision. It was crucial for us to be supported by a network of developers who continuously enhance the core software and provide assistance or solutions when needed.

Did you make any contributions to it? Any advice you could give to future users and contributors?

Yes, we have contributed to StormCrawler by creating a forked version named OWLer.

This version includes several improvements and additions we deemed necessary for our project. We've implemented extended topologies for various purposes and added a classification component to categorise and annotate URLs based on either just the URL or the URL plus website content. It serves as a labelling tool for the crawler's content. 


URLFrontier has also been expanded to accommodate these modifications, enabling crawlers to specialise in topics, languages, genres, etc. 


Moreover, we have introduced a "Crawling-On-Demand" service. Users can register their requests on the new OWler webpage by specifying a list of seed URLs and additional information. Upon submission, a StormCrawler instance is deployed in our infrastructure, fetching and storing the content as WARC files in a dedicated S3 bucket. Once completed, users receive a link to download the WARC files via email. URLFrontier tracks the progress of these crawls.

What's next?

We are currently expanding our "Crawling-On-Demand" service to include "Indexing-On-Demand." Users will be able to specify a list of seed URLs and additional tags. We will then search our database of previously crawled and processed URLs for recent content matching this list and provide it to the user in an indexed format.

LinkedIn: openwebsearch-eu
X: OpenWebSearchEU
Mastodon: @openwebsearcheu@suma-ev.social

Thursday 26 October 2023

Focus on protocol improvements in StormCrawler 2.10

StormCrawler 2.10 was released yesterday and, as usual, it contains loads of improvements, dependency upgrades and bug fixes. Instead of going through each one of them, we will focus specifically on what was done for protocols.

First, every protocol implementation can now easily be tested on the command line, even FileProtocol or DelegatorProtocol thanks to #1097. For instance, 

storm local target/xxx-1.0-SNAPSHOT.jar com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol -f crawler-conf.yaml https://storm.apache.org/ -b

which configures the RemoteDriverProtocol with the content of crawler-conf.yaml and display info on the console.

You might have noticed that the option to specify a configuration file has changed from -c to -f as the former conflicted with a Storm operator. We also added an option
-b which dumps the content of the URL to a file in the temp folder, making it very easy to check what the protocol actually retrieved for a given configuration.

Using the command above in combination with debugging is particularly powerful. This can easily be done with 

export STORM_JAR_JVM_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:8000"

One of the main changes is about the Selenium module. It had been a while since it had any work done to it and its configuration was pretty obsolete. With #1093, we added some much needed unit tests and removed some incompatible configuration. #1100 added an option to deactivate tracing and we fixed the user agent substitution (#1109).

An important and incompatible change in the Selenium module is about the way the timeouts are configured. The previous mechanism was opaque and error prone. This has been replaced in #1101, the timeouts are now configured with a map

  selenium.timeouts:
    script: -1
    pageLoad: -1
    implicit: -1

with -1 preserving the Selenium default values.

The DelegatorProtocol has also been greatly improved. If you are not familiar with it, it allows you to determine which protocol implementation should be used for a URL given the metadata it has. For instance, 

  # use the normal protocol for sitemaps
  protocol.delegator.config:
   - className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
     filters:
       isSitemap: "true"
   - className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

will use the OKHTTP protocol for a URL if is has a key isSitemap in its metadata with a value of true. Otherwise it will use the Selenium implementation.

With #1098, we added an operator indicating whether the conditions should be treated as an AND or OR. We also added the possibility to triage based on regular expressions on the URL itself (#1110). 

You can now express more complex configurations such as  

  # use the normal protocol for sitemaps, robots and if asked explicitly
  protocol.delegator.config:
   - className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
     operator: OR
     filters:
       isSitemap: "true"
       robots.txt:
       skipSelenium:
     regex:
       - \.pdf
       - \.doc
   - className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

As a result, we removed the deprecated class DelegatorRemoteDriverProtocol.

The DelegatorProtocol is of course particularly useful for avoiding sending URLs to the Selenium implementation unnecessarily, as illustrated above.


StormCrawler 2.10 contains of course other changes and dependency updates and, as usual, we recommend that you switch to it.  As we have seen today, the improvements we added to make protocol implementations easier to test and configure should be a reason to upgrade.

We would like to thank all the users and contributors to the 2.10 release.

Happy crawling!




 

Tuesday 22 March 2022

What's new in StormCrawler 2.3

StormCrawler 2.3 was released yesterday. It contains a relatively small number of changes compared to previous releases but these include important bug fixes. We have also ported existing ParseFilters to JSoupParseFilters, leading to some noticeable performance improvements and an exuberant tweet


We also welcomed Richard Zowalla as a new committer on the project.

Here are the main changes.

Dependency upgrades

  • Elasticsearch 7.17.0 
  • Tika 2.3.0
  • Caffeine 2.9.3 

Core

  • Convert LinkParseFilter into a JSoupFilter (#944)
  • Rewrote LinkParseFilter + added XPathFilter + tests for JSOUPFilters (#953)

  • General Code Refactoring and Good Practices (#937)

  • Add unified way of initializing classes via string … (#943)

  • Changed order of emit outlinks and emit of parent url ... (#954

Elasticsearch 

  • Enable compression (#941
  • Enable _source for content index in ES archetype (#958

URLFrontier

  • Spout does not reconnect to URLFrontier if an exception occurs (#956

The next release will probably include a new module for Elasticsearch 8, see #945. If you have some experience of using ES new client library, your contribution will be very welcome.

Thank you to all users and contributors, in particular Felix Engl for his work on the code refactoring and Julian Alvarez for reporting and fixing the bug in #954.

Our users Gage Piracy have also been very generous in donating some of the customisations we wrote for them back to the project.

Happy crawling!

Tuesday 11 January 2022

What's new in StormCrawler 2.2

StormCrawler 2.2 has just been released. This marks the beginning of having releases only for 2.x, 1.18 was the last release for the 1.x branch which is now discontinued. In case you were wondering why there was no "What's new in StormCrawler 2.1", it is simply that it contained the same modifications as 1.18 and did not get its own announcement.

This version contains many bugfixes, as usual, users are advised to upgrade to this version.

Happy crawling and thanks to our sponsors, contributors and users! PS: I am tempted to run a workshop on webcrawling with StormCrawler at the BigData conference in Vilnius in November. Anyone interested? If so please get in touch and let me know what you'd like to learn about. https://bigdataconference.eu/

Wednesday 5 May 2021

What's new in StormCrawler 1.18

 
StormCrawler 1.18 has just been released. Since the previous version dates from nearly 10 months ago, the number of changes is rather large (see below).

This version contains many bugfixes, as usual, users are advised to upgrade to this version. One of the noticeable new features is module for URLFrontier (if you haven't checked it up, do so right now!); I will publish a tutorial on how to use it soon.

1.18 is also likely to be the last release based an Apache Storm 1.x, our 2.x branch will become master as soon as I have released 2.1.

Happy crawling and thanks to our sponsors, contributors and users!

Monday 20 July 2020

Please welcome StormCrawler 2.0

Nearly 6 years after its initial release and after another 32 releases, StormCrawler has just reached version 2.0! 

This is similar to what we did 4 years ago when 1.0 was released, in that the change of major version reflects the version of Apache Storm that StormCrawler is based on. This is not a major refactoring of StormCrawler in any way, although some minor changes can be found, mainly in the way the topologies are submitted. These changes are documented in the READMEs generated by our archetypes.

In terms of functionalities and behavior, StormCrawler 2.0 is similar to the version 1.17 released a few minutes ago.

I expect to keep both branches in parallel for a bit, at least until StormCrawler 2.0 has been sufficiently tested and is used by the majority of our users.

The change to Apache Storm 2 is not just a way of future-proofing StormCrawler, since version 2 is the current branch in Apache Storm. By adopting Storm 2, we are also getting a platform 100% Java making debugging and possible contributions to Apache Storm itself, and we also benefit from Storm's recent improvements such as improved performance and better backpressure model.

I am looking forward to getting feedback (and bugfixes) from the StormCrawler community. Please give StormCrawler 2.0 a try if you can.

Happy crawling! 




What's new in StormCrawler 1.17


I have just released StormCrawler 1.17. As you can see in the list below, this contains important bugfixes and improvements. For this reason, we recommend that all users upgrade to this version, however, please check the breaking changes below if you apply it to an existing crawl.

Dependency upgrades

  • Various dependency upgrades  #808
  • CrawlerCommons 1.1 dependency #807
  • Tika 1.24.1 #797
  • Jackson-databind  #803 #793 #798

Core

  • Use regular expressions for custom number of threads per queue fetcher #788
  • /!breaking!/ Prefix protocol metadata #789
  • Basic authentication for OKHTTP #792
  • Utility to debug / test parsefilters #794
  • /!breaking!/ Remove deprecated methods and fields enhancement #791
  • AdaptiveScheduler to set last-modified time in metadata  #777 #812
  • /bugfix/ _fetch.exception_ key should be removed from metadata if subsequent fetches are successful #813
  • /bugfix/ SimpleFetcherBolt maxThrottleSleepMSec not deactivated #814
  • /!breaking!/ Index pages with content="noindex,follow" meta tag #750
  • Enable extension parsing for SitemapParser enhancement parser #749 #815

WARC



Elasticsearch


  • /bugfix/ AggregationSpout error due SimpleDateFormat not thread safe #809
  • /bugfix/ IndexerBolt issue causing ack failures #801
  • Allow ES to connect over a proxy #787
Of the breaking changes above, #789 is particularly important. If you want to use SC 1.17 on an existing crawl, make sure you add 

protocol.md.prefix: ""

to the configuration. Similarly, http.skip.robots has changed to http.robots.file.skip


Thanks to all contributors and users! Happy crawling! 

PS: something equally exciting is coming next ;-)



Thursday 16 January 2020

What's new in StormCrawler 1.16?

Happy new year!

StormCrawler 1.16 was released a couple of days ago. You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/26?closed=1

As usual, we recommend that all users upgrade to this version as it contains important fixes and performance improvements.

Dependency upgrades

  • Tika 1.23 (#771)
  • ES 7.5.0 (#770
  • jackson-databind from 2.9.9.2 to 2.9.10.1 dependency (#767)

Core

  • OKHttp configure authentication for proxies (#751)
  • Make URLBuffer configurable + AbstractURLBuffer uses URLPartitioner (#754)
  • /bugfix/ okhttp protocol: reliably mark trimmed content because of content limit (#757)
  • /!breaking!/ urlbuffer code in a separate package + 2 new implementations (#764)
  • Crawl-delay handling: allow `fetcher.max.crawl.delay` exceed 300 sec.(#768)
  • okhttp protocol: HTTP request header lacks protocol name and version (#775)
  • Locking mechanism for Metadata objects (#781)

LangID

  • /bugfix/ langID parse filter gets stuck (#758)

Elasticsearch

  • /bugfix/ Fix NullPointerException in JSONResourceWrappers  (#760)
  • ES specify field used for grouping the URLs explicitly in mapping (#761)
  • Use search after for pagination in HybridSpout (#762)
  • Filter queries in ES can be defined as lists (#765)
  • es.status.bucket.sort.field can take a list of values (#766)
  • Archetype for SC+Elasticsearch (#773)
  • ES merge seed injection into crawl topology (#778)
  • Kibana - change format of templates to ndjson (#780)
  • /bugfix/ HybridSpout get key for results when prefixed by "metadata." (#782)
  • AggregationSpout to store sortValues for the last result of each bucket (#783)
  • Import Kibana dashboards using the API (#785)
  • Include Kibana script and resources in ES archetype (#786)

One of the main improvements in 1.16 is the addition of a Maven archetype to generate a crawl topology using Elasticsearch as a backend (#773). This is done by calling

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-elasticsearch-archetype -DarchetypeVersion=LATEST

The generated project also contains a script and resources to load templates into Kibana.

The topology for Elasticsearch now includes the injection of seeds from a file, which was previously in a separate topology. These changes should help beginners get started with StormCrawler.

The previous release included URLBuffers, with just one simple implementation. Two new implementations have been added in #764. The brand new PriorityURLBuffer sorts the buckets by the number of acks they got since the last sort whereas the SchedulingURLBuffer tries to guess when a queue should release a URL based on how long it took its previous URLs to be acked on average. The former has been used extensively with the HybridSpout but the latter is still experimental.

Finally, we added a soft locking mechanism to Metadata (#781)  to help trace the source of ConcurrentModificationExceptions. If you are experiencing such exceptions, calling metadata.lock() when emitting e.g.

collector.emit(StatusStreamName, tuple, new Values(url, metadata.lock(), Status.FETCHED))

will trigger an exception whenever the metadata object is modified somewhere else. You might need to call unlock() in the subsequent bolts.

This does not change the way the Metadata works but is just there to help you debug.

Hopefully, we should be able to release 2.0 in the next few months. In the meantime, happy crawling and a massive thank you to all contributors!



Thursday 19 September 2019

What's new in StormCrawler 1.15?

StormCrawler 1.15 was released yesterday and as usual, contains loads of improvements and bugfixes.


We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.

Dependency upgrades

Core

  • /bugfix/ CharsetIdentification crashes on binary content (#747)
  • FetcherBolt skips tuples which have spent too much time in queues (#746)
  • Fetcher bolts generate metrics for HTTP status (#745)
  • improvements to URLFilterBolt (#740)
  • /bugfix/ FetcherBolt doesn't recover when entering maxNumberURLsInQueues (#738)
  • /bugfix/ RemoteDriverProtocol does not set user agent correctly (#735)
  • Force English Locale for SimpleDateFormat in cookie converter (#732)

LangID

  • LangId normalises and returns value found via extraction (#733)

Elasticsearch

  • Pluggable URLBuffer and Hybrid Elasticsearch spout (#752)
  • ES spouts control how long the search is allowed to take with timeout (#753)
  • Improve types used for numeric values for metrics mappings (#744)
  • Use sniffer for ES connections (#734)
  • ScrollSpout to quit logging when finished (#727)
  • ES spouts use nextFetchDate RangeQuery as a filter (#725)
  • MetricsConsumer takes an optional date format (#724)
  • StatusMetricsBolt returns a max of 10K results per status (#723)

Happy crawling and thanks to all contributors!

Friday 24 May 2019

Reindexing StormCrawler's URL Status Index

StormCrawler holds the frontier and the page fetch status in the "status" index. It's a sharded Elasticsearch index (in the most common setup) and every document contains the page URL, the fetch status, the time when the URL should be fetched, and further information as metadata. The index uses the SHA-256 of the URL as id and is usually sharded. How URLs are distributed among shards is directly related to crawler politeness. A crawler needs to limit the load to a particular server. If all URLs of the same host or domain are kept in the same shard, the spouts (we run one spout per shard) can already control that only a limited number of URLs per host or domain is emitted into the topology during a certain time interval. A controlled flow of URLs supports the fetcher bolt which ultimatively enforces politeness and guarantees a minimum delay between successive requests sent to the same server.

Motivation

The status index grows over time. Sooner or later, unless you decide to delete it and start the crawler from scratch, you hit one of the possible reasons why to reindex it:
  1. change the number of shards (to allow the index growing)
  2. fix domain names acting as sharding key (see the discussion in #684)
  3. strip metadata to save storage space
  4. apply current URL filters and normalization rules to all URLs in the status index
  5. remove the document type – previously the documents in the "status" index were of type "status". In Elasticsearch 7.0 document types are deprecated and consequently SC 1.14 also removed the document types in all indexes (status, metrics, content)
  6. merge/append indexes
  7. pull or push to another ES cluster
  8. ... (there are some more good reasons, for sure)
In my case it was a combination of points 1, 2, 3 and 5 applied to a status index holding 250 million URLs of Common Crawl's news crawler. At the same time, the server has been upgraded and I had to move the index to a new machine (point 7). Enough reasons to try the reindexing topology which has been introduced in StormCrawler 1.14 (#688). Of course, some of the points (but not all) could also be achieved by Elasticsearch standard tools.

Configure the Reindex Topology

There are multiple scenarios possible: both Elasticsearch indexes are on the same machine or cluster (using index aliases), or you pull or push from/to a remote index. My decision was to run the reindex topology on the machine hosting the new index and pull the data from the old index. To easily get around the Elasticsearch authentication, I've simply forwarded the remote ES port to the local machine running ssh -R 9201:localhost:9200 <ip_new> -fN & on the old machine. The old index is now visible on port 9201 on the new machine.

In any case, you need to first upgrade Elasticsearch on the target and source machine/cluster to version 7.0 required by StromCrawler 1.14. Or in other words, both Elasticsearch indexes must be compatible with the Elasticsearch client used by StormCrawler running the reindexing topology.

The topology is easily configured as Storm Flux:

    name: "reindexer"
    
    includes:
      - resource: true
        file: "/crawler-default.yaml"
        override: false
    
      - resource: false
        file: "crawler-conf.yaml"
        override: false   # let the properties defined in the flux take precedence
    
      - resource: false
        file: "es-conf.yaml"
        override: false
    
    config:
      # topology settings
      topology.max.spout.pending: 600
      topology.workers: 1
      # old index (remote, forwarded to port 9201)
      es.status.addresses: "http://localhost:9201"
      es.status.index.name: "status"
      es.status.routing.fieldname: "metadata.hostname"
      es.status.concurrentRequests: 1
      # new index (local)
      es.status2.addresses: "localhost"
      es.status2.index.name: "status"
      es.status2.routing: true
      es.status2.routing.fieldname: "metadata.hostname"
      es.status2.bulkActions: 500
      es.status2.flushInterval: "1s"
      es.status2.concurrentRequests: 1
      es.status2.settings:
        cluster.name: "elasticsearch"
    
    spouts:
    #  - id: "filter"
    #    className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
    #    parallelism: 10
      - id: "spout"
        className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.ScrollSpout"
        parallelism: 10   # must be equal to number of shards in the old index
    
    bolts:
      - id: "status"
        className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
        parallelism: 4
        constructorArgs:
          - "status2"
    
    streams:
      - from: "spout"
    #   to: "filter"
    #   grouping:
    #     type: FIELDS
    #     args: ["url"]
    #     streamId: "status"
    # - from: "filter"
        to: "status"
        grouping:
          streamId: "status"
          type: CUSTOM
          customClass:
            className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
            constructorArgs:
              - "byDomain"

The parts to plug in an URL filter bolt are commented out. Of course, you should review all settings carefully so that they fit your situation. One important point is the number of spouts which must be equal to the number of shards in the old index.

After some failed trials I've decided to choose defensive performance settings:
  • a not too high number of tuples pending in the topology: topology.max.spout.pending: 600. With 10 spouts there are max. 6000 pending tuples allowed.
  • four status bolts without concurrent requests (es.status2.concurrentRequests: 1).

To speed up the reindexing you might also change the refresh interval of the new Elasticsearch index by calling:
    curl -H Content-Type:application/json -XPUT 'http://localhost:9200/status/_settings' \
        --data '{"index" : {"refresh_interval" : -1 }}'

Don't forget to set it back to the default if the reindexing is done:
    curl ... --data '{"index" : {"refresh_interval" : null }}'

Running the Topology

The reindex topology is started as Flux via:
    java -cp crawler-1.14.jar org.apache.storm.flux.Flux --remote reindex-flux.yaml

Depending on the size of your index it might run longer. I've achieved 6,000 documents reindexed per second which means that the entire 250 million docs are reindexed after 12 hours.

Verifying the New Status Index

The topology has finished now let's check the new index and whether everything has been reindexed. Let's first get the metrics of the old status index (on port 9201):
%> curl -H Content-Type:application/json -XGET 'http://localhost:9201/_cat/indices/status?v'
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   status xFN4EGHYRaGw8hi4qxMWrg  10   1  250305363      6038664      101gb          101gb
and compare it with those of the new status index:
%> curl -H Content-Type:application/json -XGET 'http://localhost:9200/_cat/indices/status?v'
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   status ZFNT8FovT4eTl3140iOUvw  16   1  250278257         5938     87.2gb         87.2gb
Great! The new index now has 16 shards and 10 GB storage have been saved by stripping unneeded metadata. All 250 million documents have been reindexed. But stop: it's not all documents – a few thousand are missing! Panic! What's going on? Checked the logs for errors – nothing. Also: why there are deleted documents in the new index?

Ok, let's first calculate the loss: 250305363 – 250278257 = 27106 (0.01%). Well, probably not worth to redo the procedure, either the links are outdated or the crawler will find them again. Anyway, I was interested to figure out the reason. But how to find the missing URLs? – It's not trivial to compare two lists of 250 million items.

The solution is to get first the counts of items per domain aggregating counts on the field "metadata.hostname" which is used for routing documents to shards. The idea is to find the domains where the counts differ and then compare only the per-domain lists. Let's do it:
curl -s -H Content-Type:application/json -XGET http://localhost:9200/status/_search --data '{
  "aggs" : {
    "agg" : {
      "terms" : {
        "field" : "metadata.hostname",
        "size" : 100000
      }
    }
  }
}' | jq --raw-output '.aggregations.agg.buckets[] | [(.doc_count|tostring),.key] | join("\t")' \
   | rev | sort | rev >domain_counts_new_index.txt
A short explanation what this command does:
  1. we send an aggregation query to the Elasticsearch index which counts documents per domain name (kept in "metadata.hostname")
  2. the JSON output is processed by jq and the output is written into a text file with two columns – count and domain name
  3. the list is sorted using reversed strings – this keeps the counts together per top-level domain, domain, subdomain which makes the comparison easier
The procedure to get the per-domain counts from the old index is the same. After we have both lists we can compare them side by side:
%> diff -y domain_counts_old_index.txt domain_counts_new_index.txt
...
4       athletics.africa              | 80      athletics.africa
76      www.athletics.africa          <
108     chri.ca                         108     chri.ca
2       www.alta-frequenza.corsica    | 2       alta-frequenza.corsica
2       corsenetinfos.corsica         | 3       corsenetinfos.corsica
1       www.corsenetinfos.corsica     <
...
Already the first difference brought up the reason for the differences between the old and new index! You remember, there was this issue with domain names changing over time with different versions of the public suffix list? It was one of the reasons why the reindexing topology had been introduced, see #684 and news-crawler#28). If the routing key is not stable it might happen that the same URL is indexed twice with different routing keys in two shards. Exactly this happened for some of the domains affected by this issue. Here is one example:
6       hr.de                |  19      hr.de
2       reportage.hr.de      <
1       daserste.hr.de       <
11      www.hr.de            <
In the new index there are only 19 items although the sum of 6 + 2 + 1 + 11 is 20. I've checked the remaining differences and the assumption has proven true: affected are only domain names either with recently introduced TLDs (.africa has been registered by ICANN in 2017) or misclassified suffixes (co.uk is a valid suffix but hr.de is not).

Everything is fine now and the only open point is to bring the new index into production and restart the crawl topology on the new machine!

Monday 13 May 2019

What's new in StormCrawler 1.14

StormCrawler 1.14 was released yesterday and as usual, contains loads of improvements and bugfixes.


This release contains a number of breaking changes, mostly related to the move to Elasticsearch 7. We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.

Dependency upgrades

  • crawler-commons 1.0 #693
  • okhttp 3.14.0 #692
  • guava 27.1 (#702)
  • icu4j 64.1  #702)
  • httpclient 4.5.8  #702)
  • Snakeyaml 1.24  #702)
  • wiremock 2.22.0  #702)
  • rometools 1.12.0  #702)
  • Elasticsearch 7.0.0 (#708)

Core



  • Track how long a spout has been without any URLs in its buffer (#685)
  • Change ack mechanism for StatusUpdaterBolts (#689)
  • Robots URL filter to get instructions from cache only (#700)
  • Allow indexing under canonical URL if in the same domain, not just host (#703)
  • /bugfix/ URLs ending with a space are fetched over and over again (#704)
  • ParseFilter to normalise the mime-type of documents into simple values (#707)
  • Robot rules should check the cache in case of a redirection (#709)
  • /bugfix/ Fix the logic around sitemap = false (#710)
  • Reduce logging of exceptions in FetcherBolt (#719)

Elasticsearch



  • Asynchronous spouts (i.e ES) can send queries after max delay since previous one ended  (#683)
  • StatusUpdaterBolt to load config from non-default param names (#687)
  • Add a ScrollSpout to read all the documents from a shard (#688 and #690) - see in our guest post how this can be used to reindex a status index.
  • ES IndexerBolt : check success of batches before acking tuples (#647)
  • /bugfix/ URLs with content that breaks ES get refetched over and over again (#705)
  • /bugfix/ URLs without valid host name (and routing) stay DISCOVERED forever (#706)
  • /bugfix/ ESSeedInjector: no URLs injected because URL filter does not subscribe to status stream (#715)
  • MetricsConsumer to include topology ID in metrics(#714)

WARC

  • Generate WARC request records (#509)
  • WARC format improvements (#691)

Tika


  • Set mimetype whitelist for Tika Parser (#712)

*********

I will be running a workshop on StormCrawler next month at the Web Archiving Conference in Zagreb and give a presentation jointly with Sebastian Nagel of CommonCrawl. I will come with loads of presents generously given by our friends at Elastic.


As usual, thanks to all contributors and users.

Happy crawling!