Showing posts with label elasticsearch. Show all posts
Showing posts with label elasticsearch. Show all posts

Friday 25 May 2018

What's new in StormCrawler 1.9


Dependency upgrades
Core
  • Crawl-delay in robots.txt should optionally not shrink the configured delay #549
  • Optimisation: faster extraction of META tags #553
  • CollectionMetric synchronized access to List #555
  • Configurable Robots Caches #557
  • JSOUPParserBolt: lazy DOM conversion #563
  • Purge internal queues of tuples which have already reached timeout #564
  • Added ParseFilter to convert single valued Metadata to multi-valued ones #571
  • Caching of redirected robots.txt may overwrite correct robots.txt rules, fixes #573
WARC
  • WARCBolt to handle incorrect URIs gracefully #560
  • WARCRecordFormat use ByteBuffer instead of ByteArrayOutputStream #561
Archetype
  • Uses flux-core 1.2.1 #559
  • Added FeedParser to archetype topology #551
  • Added .kml and .wmv to url filters
SOLR
  • MetricsConsumer handles recursive values #554
Elasticsearch
  • MetricsConsumer handles recursive values #554
  • ES Indexer and Deletion Bolts to get index name from constructor #572
LanguageID
  • Added option to LanguageID to skip if metadata already set #570
As usual, we advise all users to move to this version as it fixes several bugs. Thanks to all contributors and users. Happy crawling!

Tuesday 20 March 2018

What's new in StormCrawler 1.8

I have just released StormCrawler 1.8. As usual, here is a summary of the main changes:


Dependency updates
Core
  • Add option to send only N bytes of text to indexers #476
  • BasicURLNormalizer to optionally convert IDN host names to ASCII/Punycode #522
  • MemorySpout to generate tuples with DISCOVERED status #529
  • OKHttp configure type of proxy #530
  • http.content.limit inconsistent default to -1 #534
  • Track time spent in the FetcherBolt queues #535
  • Increase detect.charset.maxlength default value #537
  • FeedParserBolt: metadata added by parse filters not passed forward in topology #541
  • Use UTF-8 for input encoding of seeds (FileSpout) #542
  • Default URL filter: exclude localhost and private address spaces #543
  • URLStreamGrouping returns the taskIDs and not their index #547
WARC
  • Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #520
SOLR
  • Schema for status index needs date type for nextFetchDate #544
  • SOLR indexer: use field type text for content field #545
Elasticsearch
  • AggregationSpout fails with default value of es.status.bucket.field == _routing #521
  • Move to Elasticsearch RESTAPi #539
We recommend all users to move to this version as it fixes several bugs (#541#547) and adds some great new features. In particular, the use of the REST API for Elasticsearch, which makes the module future-proof but also easier to configure, but also #535 and #543.

As usual, thanks to all contributors and users. Happy crawling!

Monday 29 May 2017

What's new in StormCrawler 1.5

StormCrawler 1.5 has just been released! It is an important road mark with the move to Elasticsearch 5.x and the implementation of long-awaited features such as the Selenium-based protocol. The code has been improved in many ways and despite the seemingly low number of lines below, this new release is a mammoth one!

The project, in general, is in very good health, with more and more organisations using it in production, and an increased visibility, reflected by the growing number of questions on StackOverflow.

Here are the main changes in 1.5.

CORE DEPENDENCIES UPGRADES
  • Apache Storm 1.1.0 (#450)

CORE MODULE
  • HTTP Protocol: implement cookie handling (#32)
  • java.util.zip.ZipException: Not in GZIP format thrown on redirs with httpclient (#455)
  • Selenium-based protocol implementation (#144) which I described in a separate blog post
  • Indicate whether RobotsRules come from cache or have been fetched (#460)
  • Memory issues when ByteArrayBuffer gets instantiated with a large value despite maxLength being set (#462)
  • FetcherBolt to dump URLs being fetched to log (#464)
  • Override sitemapsAutoDiscovery settings per URL (#469)

Knowing whether RobotsRules come from the cache gives us more insights into the behaviour of the crawlers as we can display the ratio of cache vs live (see illustration below)


as well as pages fetched vs robots fetched.

  

ELASTICSEARCH
  • Utility class to export URL and metadata from ES index to file (#444)
  • Fixed sampling with aggregation spout in ES5
  • Upgrade to Elasticsearch 5.3 (#221 and  #451)
  • Optimise nextFetchDate to speed up queries to Elasticsearch (#429 and  #452)
  • Delete gone pages from index (#253)
  • metrics - remove filtering (#281)

One of the main changes related to Elasticsearch is the removal of ElasticsearchSpout and the introduction of CollapsingSpout, which uses the brand new FieldCollapsing in Elasticsearch. We also fixed a concurrency issue in the StatusUpdaterBolt (9fefac8), improved the efficiency of the spouts by getting them to process results in a separate thread (1b0fb42), which combined with the optimisation of nextFetchDate (see above) and the fix of the sampling in AggregationSpout, means that the Elasticsearch module is more efficient than ever.

The move to Elasticsearch 5.x was not without difficulties but the result justifies the effort. I described in a separate post the common pitfalls of upgrading an existing topology to Elasticsearch 5.

Coming next?
As usual, it is hard to guess what the next release will be made of as the project is driven by its community.

Having said that, I'd expect the Selenium-based protocol to get improved as users start to use it. It is also likely that we'll move away from Apache HttpClient library (#443). As mentioned in the previous release, we'll probably upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser.

In the meantime and as usual, thanks to all contributors and users and happy crawling!




Monday 15 May 2017

Avoid these common pitfalls when upgrading StormCrawler with Elasticsearch 5.x

The next (and probably imminent) release of StormCrawler will contain an update of Elasticsearch to version 5.3. This is definitely a good thing, as we want to keep up with the latest versions of Elasticsearch but has a few pitfalls when upgrading your existing application. Some of the changes are documented in the README but I will reiterate them here, just in case.

LOG4J dependencies
ES5 requires an upgrade in the logging dependencies of Apache Storm. You can update the dependencies in your existing Storm cluster by hand but since my patch is part of Storm 1.1.0, you should probably upgrade Storm altogether. StormCrawler 1.5 will depend on Storm 1.1.0 (but probably works with older versions as well).

Maven Shade Configuration
The pom file of your StormCrawler-based project needs modifying as well, you'll need to specify the Maven Shade Configuration and include:
<manifestEntries>
 <Change></Change>
 <Build-Date></Build-Date>
</manifestEntries>

See https://github.com/elastic/elasticsearch/issues/21627; this wasn't an issue with the previous versions of Elasticsearch.

Update es-conf.yaml
In particular, the value of es.status.bucket.field used to be _routing, which is an automatically generated field, however this is not available for the spouts anymore. Instead, use the same value as es.status.routing.fieldname e.g. metadata.hostname

Mapping
ES5 should be able to read your existing indices, however, if you create a new set of indices from scratch, make sure you use the latest version of the script.

I hope this will help you for a successful upgrade, I will cover the new functionalities and improvements coming with StormCrawler 1.5 when it is released.

Happy crawling

Tuesday 4 April 2017

Video Tutorial - StormCrawler + Elasticsearch + Kibana

This tutorial explains how to configure Elasticsearch with StormCrawler. 

We first bootstrap a StormCrawler project using the Maven archetype, have a look at the resources and code generated, then modify the project so that it uses Elasticsearch. We then run an injection topology and the crawl topology before setting up Kibana for monitoring the metrics and content of the status index.



(with my apologies for the quality of the sound)

Enjoy

Julien

Wednesday 29 March 2017

Full day workshop(s) on StormCrawler (+Elasticsearch and Kibana)


I will be running a full-day workshop on crawling with StormCrawler on the 24th April in Berlin. See full details on https://endoctus.com/course/web-crawling-with-stormcrawler.

Please find the program below:

In this workshop, we will explore StormCrawler a collection of resources for building low-latency, large scale web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what Storm-Crawler provides, we'll put it to use straight away for a simple crawl before moving on to the deployed mode of Storm

In the second part of the session, we will then introduce metrics and index documents with Elasticsearch and Kibana and dive into data extraction. Finally, we'll cover recursive crawls and scalability. This course will be hands-on: attendees will run the code on their own machines.  

This course will suit Java developers with an interest in big data, stream processing, web crawling and search. It will provide a practical introduction to both Apache Storm and Elasticsearch as well of course as StormCrawler and should not require advanced programming skills. 

Duration : 2x3 hours 


PS: Do you follow DigitalPebble or StormCrawler on Twitter? Announcements and updates are made there (as well as all sorts of interesting news of course!) 

Thursday 23 March 2017

What’s new in StormCrawler 1.4

StormCrawler 1.4 has just been released! As usual, all users are advised to upgrade to this version as it fixes some bugs and contains quite a few new functionalities.

Core dependencies upgrades

  • Httpclient 4.5.3
  • Storm 1.0.3 #437

Core module

  • JSoupParser does not dedup outlinks properly, #375
  • Custom schedule based on metadata for non-success pages, #386
  • Adaptive fetch scheduler #407
  • Sitemap: increased default offset for guessing + made it configurable  #409
  • Added URLFilterBolt + use it in ESSeedInjector #421
  • URLStreamGrouping 425
  • Better handling of redirections for HTTP robots #4372d16
  • HTTP Proxy over Basic Authentication #432
  • Improved metrics for status updater cache (hits and misses) #434
  • File protocol implementation #436
  • Added CollectionMetrics (used in ES MetricsConsumer + ES Spout, see below) #7d35acb

AWS

  • Added code for caching and retrieving content from AWS S3 #e16b66ef

SOLR

  • Basic upgrade to Solr 6.4.1
  • Use ConcurrentUpdateSolrClient; #183

Elasticsearch

  • Various changes to StatusUpdaterBolt
    Fixed bugs introduced in 1.3 (use of SHA ID), synchronisation issues, better logging, optimisation of docs sent and more robust handling of tuples waiting to be acked (#426). The most important change is a bug fix whereby the cache was never hit (#442) which had a large impact on performance.
  • Simplified README + removed bigjar profile from pom #414
  • Provide basic mapping for doc index #433
  • Simple Grafana dashboard for SC metrics, #380
  • Generate metrics about status counts, #389
  • Spouts report time taken by queries using CollectionMetric, #439 - as illustrated below
Spout query times displayed by Grafana
(illustrating the impact of SamplerAggregationSpout on a large status index )

Coming next?

As usual, it is not clear what the next release will contain but hopefully, we'll switch to Elasticsearch 5 (you can already take it from the branch es5.3) and provide resources for Selenium (see branch jBrowserDriver). As I pointed out in my previous post, getting early feedback on work in progress is a great way of contributing to the project.

We'll probably also upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser. We might move to one of the next releases of Apache Storm, where a recent contribution I made will make it possible to use Elasticsearch 5. Also, some of our StormCrawler code has been donated to Storm, which is great!

In the meantime and as usual, thanks to all contributors and users and happy crawling!

PS: I will be running a workshop in Berlin next month about StormCrawler, Storm in general and Elasticsearch


Tuesday 10 January 2017

What's new in StormCrawler 1.3


StormCrawler 1.3 has just been released! As usual, all users are advised to upgrade to this version as it fixes some bugs and contains quite a few new functionalities and improved performance. 

Dependencies upgrades

  • Jsoup 1.10.1
  • Crawler-Commons 0.7
  • RomeTools to 1.7.0
  • ICU4J 58.2

Core module

  • Hardcoded limit to the max # connections allowed by protocol #388
  • LangID module #364
  • JsoupParserBolt can use first N bytes for charset detection (or not at all) #391
  • SimpleFetcherBolt uses allowRedir from super class #394 (bugfix)
  • URLNormalizer : Decode non-standard percent encoding prior to re-encoding
  • MaxDepthFilter defaults to -1, 0 removes all outlinks, can set a custom max depth per URL with max.depth. Implements #399 and #400

The latter breaks compatibility with the previous versions: 0 was used to deactivate the filtering by depth, whereas now it is used to prevent any outlinks from being processed. Please change your config to -1 if you want to deactivate the filtering.

Elasticsearch

  • Flux for crawl and injection topologies #372
  • Use min delay for all types of Spouts #370
  • Remove Node client #377
  • ESSpout deals with deep paging before building query
  • Topology status updater triaged by URL to hit cache
  • Settings done via configuration #376
  • Add plugin to the clients via configuration #378
  • Spouts: load results with a non-blocking call #371
  • Concurrent requests in config #382
  • StatusUpdaterBolt - do not add URL already in buffer for ES if status is DISCOVERED
  • Allow fieldNameForRoutingKey to be outside metadata and use a different key for spouts #384
  • Use SHA256 as doc_id #385
  • Separate Kibana schema for status and metrics + put all schemas in a separate folder
  • Improvements to ES_IndexInit
  • ES crawl topology uses FetcherBolt
Please note that the cluster name is now defined alongside the other settings:
  es.status.settings:
    cluster.name: "elasticsearch"
One of the benefits of #376 and  #378 is that you can now use StormCrawler with Elastic Cloud protected with Shield.

We are fast approaching our 1.000th commit! Thanks to all users and contributors for their help with StormCrawler. Happy crawling!

PS: I will be running a 1-day workshop in Berlin on the 2nd of February. Announcements will be made on our Twitter account


Tuesday 3 January 2017

The Battle of the Crawlers : Apache Nutch vs StormCrawler

Happy New Year everyone!

For this first blog post of 2017, we'll compare the performance of StormCrawler and Apache Nutch. As you probably know, these are open source solutions for distributed web-crawling and we provided an overview last year of both as well as a performance comparison when crawling a single website.

StormCrawler has been steadily gaining in popularity over the last 18 months and a frequent question asked by prospective users is how fast it is compared to Nutch. Last year's blog post provided some insights into this but now we'll go one step further by crawling not a single website, but a thousand. The benchmark will still be on a single server though but will cover multi-million pages.

Disclaimer: I am a committer on Apache Nutch and the author of StormCrawler.

Meet the contestants

Please have a look at our previous blog post for a more detailed description of both projects. This Q and A should also be useful. 

Apache Nutch is a well-established web crawler based on Apache Hadoop. As such, it operated by batches with the various aspects of web crawling done as separate steps (e.g. generate a list of URLs to fetch, fetch, parse the web pages and update its data structures.

In this benchmark, we'll use the 1.x version of Nutch. There is a 2.x branch but as we saw in a previous benchmark, it is a lot slower. It also lacks some of the functionalities of 1.x and is not actively maintained.

StormCrawler, on the other hand, is based on Apache Storm, a distributed stream processing platform. All the web crawling operations are done continuously and at the same time.

What we can assume (and observed previously) is that StormCrawler should be more efficient as Nutch does not fetch web pages continuously, but only as one of the various batch steps. On top of that, some of its operations - mainly the ones that deal with the crawldb, the datastructure used by Nutch - take increasingly longer as the size of the crawl grows.


The battleground

We ran the benchmark on a dedicated server provided by OVH with the following specs :

Intel  Xeon E5 E5 1630v3 4c/8t  3,7 / 3,8 GHz
64 GB of RAM DDR4 ECC 2133 MHz
2x480GB RAID 0 SSD

Ubuntu 16.10 server

We installed the following software

Apache Storm 1.0.2
Elasticsearch 2.4
Kibana 4.6.3

StormCrawler 1.3-SNAPSHOT

Hadoop 2.7.3
Apache Nutch 1.13-SNAPSHOT

Finally, the resources and configurations for the benchmark can be found on the first release of  https://github.com/DigitalPebble/stormcrawlerfight.

We followed the recommendations from 
for the configuration of Elasticsearch on SSD and gave it 10GB RAM to run on.

Apache Nutch 

The configuration for Nutch can be found in the GitHub repo under the nutch directory. This should allow you to reproduce the benchmarks if you wished to do so.

The main changes to the crawl script, apart from the addition of a contribution I recently made to Nutch, was to : 

  • set the number of fetch threads to 500
  • change the max size of the fetchlist to 50,000,000
  • use 4 reducer tasks
  • remove the link inversion and dedup steps

The latter was done in order to keep the crawl to a minimum. We left the setting for the limitation of fetch time to 3 hours. The aim of this was to avoid long tails in the fetching step, where the process is busy fetching from only a handful of slow servers. 

In order to optimise the crawl, we limited the number of URLs per hostname in the fetchlist to 100, which guarantees a good distribution of URLs and again, prevents the long tail phenomenon, which is commonly observed with Nutch. We also tried to avoid the conundrum whereby setting too low a duration for the fetching step requires more crawl iterations, meaning that more generate and update steps are necessary.


Note: we initially intended to index the documents into Elasticsearch, however, this step proved unreliable and caused errors with Hadoop. We ended up deactivating the indexing step from the script, which should benefit Nutch when comparing to StormCrawler.

We ran 10 crawl iterations between 2016.12.16 11:18:37 CET and 2016.12.17 19:29:30 CET, the breakdown of times per step is as follows : 



Iteration #StepsTime
1
Generation0:00:38
Fetcher0:00:45
Parse0:00:26
Update0:00:24
2
Generation0:00:40
Fetcher0:23:26
Parse0:01:53
Update0:00:30
3
Generation0:00:52
Fetcher0:55:46
Parse0:08:24
Update0:01:07
4
Generation0:01:15
Fetcher1:08:36
Parse0:19:00
Update0:02:01
5
Generation0:02:03
Fetcher2:14:20
Parse0:47:59
Update0:04:34
6
Generation0:03:55
Fetcher4:20:02
Parse1:30:58
Update0:08:46
7
Generation0:06:44
Fetcher3:52:36
Parse1:09:40
Update0:08:15
8
Generation0:08:31
Fetcher3:48:35
Parse1:04:27
Update0:08:32
9
Generation0:10:00
Fetcher3:51:57
Parse1:18:38
Update0:09:27
10
Generation0:11:44
Fetcher3:32:35
Parse0:59:13
Update0:09:39
33:08:53

What you can observe is that the generate and update steps do take an increasingly longer time, as mentioned above.

The final stats from the final update step were :


db_fetched
10,626,298
db_gone
686,834
db_redir_perm
123,087
db_redir_temp
217,191
db_unfetched
64,678,627


which gives us a total of 11,653,410 URLs processed (fetch + gone + redirs) in a total time of 1930 minutes.



On average, Nutch fetched 6,038 URLs per minute.

The graph below shows the bandwidth usage of the server when the Nutch crawl was running.

Network graph of Nutch crawl

This is a good illustration of the batch nature of Nutch, where the fetching is only one part of the whole process.

Let's now see how StormCrawler fared in a similar situation.

StormCrawler

StormCrawler can use different backends for storing the status of the URLs (i.e. which is what the crawldb does in Nutch). For this benchmark, we used the Elasticsearch module of StormCrawler as it is the most commonly used. This means that we won't just be storing the content of the webpages to Elasticsearch, we'll also be using it to store the status of the URLs as well as displaying metrics about the crawl with Kibana.

We ran the crawl for over 2 and a half days and got the following values in the status index



DISCOVERED
188,396,525
FETCHED
32,656,149
ERROR
2,901,502
REDIRECTION
2,050,757
FETCH_ERROR
1,335,437
which means a total of 38,943,845 webpages processed over 3977minutes, i.e. an average of 9792.26 pages per min.

The network graph looked like this:

Network graph of StormCrawler crawl

which, apart from an unexplained and possibly unrelated spike on Christmas day, shows a pretty solid use of the bandwidth. Whereas Nutch was often around the 50M mark, StormCrawler is lower but constant.

The metrics stored in Elasticsearch and displayed with Kibana gave a similar impression:

StormCrawler metrics displayed with Kibana
Interestingly, the Storm UI indicated that the bottleneck of the pipeline was the update step, which is not unusual given the 'write-heavy' nature of StormCrawler.

Conclusion


This benchmark as set out above shows that StormCrawler is 60% more efficient than Apache Nutch. We also found StormCrawler to run more reliably than Nutch but this could be due to a misconfiguration of Apache Hadoop on the test server. We had to omit the indexing step from the Nutch crawl script because of reliability issues, whereas the StormCrawler topology did index the documents successfully. This would have added to the processing time of Nutch.



The main explanation lies in the design of the crawlers: Nutch achieves greater spikes in the fetching step but does not fetching continuously as StormCrawler does. I had compared Nutch to a sumo and StormCrawler to a ninja previously but it seems that the tortoise and hare parable would be just as appropriate.


It is important to bear in mind that the raw performance of the crawlers is just one aspect of an overall comparison. One should also consider the frequency of releases and contributions as well as more subjective aspects such as the ease of use and versatility. There is also of course the question of the functionalities provided. To be fair to Nutch, it currently does thing that StormCrawler does not yet support such as document deduplication and scoring. On the other hand, StormCrawler too has a few aces up its sleeve with Xpath extraction, sitemap processing and live monitoring with Kibana.


As often said in similar situations: “your mileage might vary”. The figures given here depend on the particular seed list and hardware, you might get different results on your specific use case. The resources and configurations of the benchmark being publicly available, you can reproduce it and extend it as you wish.


Hopefully we’ll run more benchmarks in the future. These could cover larger scale crawling in fully distributed mode and/or comparing different backends for StormCrawler (e.g. Redis+RabbitMQ vs Elasticsearch).


Happy crawling!