Showing posts with label open source. Show all posts
Showing posts with label open source. Show all posts

Tuesday 20 March 2018

What's new in StormCrawler 1.8

I have just released StormCrawler 1.8. As usual, here is a summary of the main changes:


Dependency updates
Core
  • Add option to send only N bytes of text to indexers #476
  • BasicURLNormalizer to optionally convert IDN host names to ASCII/Punycode #522
  • MemorySpout to generate tuples with DISCOVERED status #529
  • OKHttp configure type of proxy #530
  • http.content.limit inconsistent default to -1 #534
  • Track time spent in the FetcherBolt queues #535
  • Increase detect.charset.maxlength default value #537
  • FeedParserBolt: metadata added by parse filters not passed forward in topology #541
  • Use UTF-8 for input encoding of seeds (FileSpout) #542
  • Default URL filter: exclude localhost and private address spaces #543
  • URLStreamGrouping returns the taskIDs and not their index #547
WARC
  • Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #520
SOLR
  • Schema for status index needs date type for nextFetchDate #544
  • SOLR indexer: use field type text for content field #545
Elasticsearch
  • AggregationSpout fails with default value of es.status.bucket.field == _routing #521
  • Move to Elasticsearch RESTAPi #539
We recommend all users to move to this version as it fixes several bugs (#541#547) and adds some great new features. In particular, the use of the REST API for Elasticsearch, which makes the module future-proof but also easier to configure, but also #535 and #543.

As usual, thanks to all contributors and users. Happy crawling!

Tuesday 28 November 2017

What's new in StormCrawler 1.7



Amazingly this is the 20th release of StormCrawler! Here are the main changes:

Dependencies updates
  • crawler-commons 0.9 #513
Core
  • (bugfix) ParserBolts should use outlinks from parsefilters #498
  • LD_JSON parsefilter #501
  • okhttp : store request and response headers verbatim in metadata #506
  • (bugfix) okhttp protocol does not store headers in metadata #507
  • HTTP clients should handle http.accept.language and http.accept #499
  • Selenium protocol follows redirections #514
  • RemoteDriverProtocol needs multiple instances #505
  • SitemapParserBolt should force mime-type based on the clue #515
Elasticsearch
  • ES Spout : define filter query via config #502
  • Upgrade to ES 6.0 #517
We recommend all users to move to this version. If you wish to remain on an older version of Elasticsearch, you can simply keep your existing version of the stormcrawler elasticsearch module while upgrading stormcrawler core.

This version improves the processing of sitemaps, via #515 and the use of the crawler-commons 0.9 where we fixed the SAX parsing and extended its coverage. We also added improvements to our okhttp-based protocol implementation. If your crawl is a wide one with potentially any sort of content then you should go for okhttp over the default httpclient one. See our comparison of protocol implementations on the WIKI.

Finally, if you want to extract semantic data represented in ld-json then you'll love #501.

As usual, thanks to all contributors and users. Happy crawling!

Friday 8 September 2017

What's new in StormCrawler 1.6


Dependencies updates

  • jsoup 1.10.3
  • crawler-commons 0.8

Core

  • Use ISO representation of time for modifiedtime in adaptivescheduler  #496
  • Use ISO representation of time for discoveryDate and lastProcessedDate, #477
  • Improved Charset Detection #495
  • SitemapParserBolt configure use SAX or not
  • SitemapParserBolt generates metrics for average processing time
  • HTTP protocol based on OKHTTP #484 
  • Apache Http client can use HEAD method on a per URL basis #485
  • ContentFilter to leave trace of the pattern that matched #480
  • Metadata has a new public method for getting first non-empty value from a set of keys
  • Added ARTICLE to patterns for content filter

LangID

  • Can add more than one lang code based on configurable prob threshold. #481

WARC

  •  Added rotation policy based on time and filesize

ES

  • ES: added es.status.reset.fetchdate.after #478
  • Removed Grafana resources - can be downloaded from Grafana portal







Monday 29 May 2017

What's new in StormCrawler 1.5

StormCrawler 1.5 has just been released! It is an important road mark with the move to Elasticsearch 5.x and the implementation of long-awaited features such as the Selenium-based protocol. The code has been improved in many ways and despite the seemingly low number of lines below, this new release is a mammoth one!

The project, in general, is in very good health, with more and more organisations using it in production, and an increased visibility, reflected by the growing number of questions on StackOverflow.

Here are the main changes in 1.5.

CORE DEPENDENCIES UPGRADES
  • Apache Storm 1.1.0 (#450)

CORE MODULE
  • HTTP Protocol: implement cookie handling (#32)
  • java.util.zip.ZipException: Not in GZIP format thrown on redirs with httpclient (#455)
  • Selenium-based protocol implementation (#144) which I described in a separate blog post
  • Indicate whether RobotsRules come from cache or have been fetched (#460)
  • Memory issues when ByteArrayBuffer gets instantiated with a large value despite maxLength being set (#462)
  • FetcherBolt to dump URLs being fetched to log (#464)
  • Override sitemapsAutoDiscovery settings per URL (#469)

Knowing whether RobotsRules come from the cache gives us more insights into the behaviour of the crawlers as we can display the ratio of cache vs live (see illustration below)


as well as pages fetched vs robots fetched.

  

ELASTICSEARCH
  • Utility class to export URL and metadata from ES index to file (#444)
  • Fixed sampling with aggregation spout in ES5
  • Upgrade to Elasticsearch 5.3 (#221 and  #451)
  • Optimise nextFetchDate to speed up queries to Elasticsearch (#429 and  #452)
  • Delete gone pages from index (#253)
  • metrics - remove filtering (#281)

One of the main changes related to Elasticsearch is the removal of ElasticsearchSpout and the introduction of CollapsingSpout, which uses the brand new FieldCollapsing in Elasticsearch. We also fixed a concurrency issue in the StatusUpdaterBolt (9fefac8), improved the efficiency of the spouts by getting them to process results in a separate thread (1b0fb42), which combined with the optimisation of nextFetchDate (see above) and the fix of the sampling in AggregationSpout, means that the Elasticsearch module is more efficient than ever.

The move to Elasticsearch 5.x was not without difficulties but the result justifies the effort. I described in a separate post the common pitfalls of upgrading an existing topology to Elasticsearch 5.

Coming next?
As usual, it is hard to guess what the next release will be made of as the project is driven by its community.

Having said that, I'd expect the Selenium-based protocol to get improved as users start to use it. It is also likely that we'll move away from Apache HttpClient library (#443). As mentioned in the previous release, we'll probably upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser.

In the meantime and as usual, thanks to all contributors and users and happy crawling!




Friday 17 March 2017

Contribute to an open source project beyond code

I was recently contacted by someone who liked StormCrawler, wanted to contribute to it and asked me how to do so. While most contributions to open source projects take the form of code to either fix bugs or add new functionalities, there are various other ways in which people can contribute.

Here is what I replied to him, and while the examples below are about StormCrawler, the same ideas can apply to pretty much any open source project.

  • Spread the word: if you use StormCrawler, why not blog/tweet about it and get listed on the powered by page? The more people see that it is used, the more confident they become in adopting it. If you are not too shy: why not give a short presentation at a local tech meetup or a bigger conference?
  • Help with the documentation or write tutorials: we have WIKI pages and various instructions on the site - going through those would be a good way of learning about Apache Storm and StormCrawler while at the same time make a useful contribution.
  • Find bugs and possible improvements: run the code, benchmark it, look at the logs for unexpected things. Just play and see! If something is not clear, then the docs can be improved (see previous point).
  • Test things in branches / PRs: for instance, I started work on jBrowserDriver and Elasticsearch 5. Giving new functionalities an early try is fab.
  • Help others:  you have used StormCrawler a bit? Join the mailing list or follow StackOverflow and help newcomers overcome the hurdles as you did.
  • Donate resources: your company has one or more servers they are not using (unlikely but who knows)? You have AWS credits and don't know what to do with them? We can always do with test machines.
Any of those forms of contributions is valuable! Writing code is good but that's just one part of making a project successful.

PS: if you are wondering what happened with that prospective contributor, he's taken one of the open issues and doing great work on it!