Showing posts with label elasticsearch. Show all posts
Showing posts with label elasticsearch. Show all posts

Monday 31 October 2016

What's new in StormCrawler 1.2

StormCrawler 1.2 has been released today after a busy and exciting month, the highlight of which was certainly the announcement by CommonCrawl of the news dataset, which is powered by StormCrawler. This helped raise the profile of the project and also brought various improvements to the WARC and Elasticsearch modules (see below). Another great news was that my talk got accepted for ApacheCon BigData next month in Seville which prompted a Q&A interview on Linux.com.

Back to the content of the release. There have been many improvements on various levels, the main one being that the WARC module was moved to the main repository [#313]. It got many bugfixes and improvements since used by CommonCrawl and is now stable enough to join the other external modules.

We recommend all users to upgrade their configuration to the version 1.2 of StormCrawler.

Apart from minor bug fixes, the main changes in this new version are :


Core


  • Removes StatusStreamBolt [#341]
  • New Parse Filters :
    • MD5 signature [#354]
    • DomainParseFilter [#356]
  • URL Filters
    • URL Normalisation - remove parameters where the value is a 32-bit hash [#363]
    • Filtering : treat path parameters as query parameters [#366]
    • BasicURLFilter to remove URLs based on path repetition and max length [#368]
  • Add metadata.discoveryDate field to enable tracking discovery rate [#360]
  • Add super class for bolts using the status stream [#353]
  • JSoup Handle redirections via meta tag [#350]

Tika

  • Upgraded to Tika 1.13 [#285]
  • Combine JSoupParser with Tika [#357]
  • Tika parser can now parse embedded documents [#358]
Elasticsearch

  • Elasticsearch upgraded to 2.4.1
  • Metadata keys with multiple values not indexed correctly in ES [#345]
  • Refactoring into AbstractSpout for ES [#348]
  • Status index - fields stored unnecessarily [#351]
  • Cache URLs post ack/fail [#349]
The latter is a substantial change to the way the Elasticsearch spouts work. All 3 flavours of spouts hold a cache of the URLs being processed and use it to make sure that any URLs returned by a query are not added twice. This worked OK but did not cater for situations where a URL was towards the bottom of the buffer and acked/failed not long before the buffer was refilled from ES. In such cases, the changes to the status index hadn't had the time to be committed to the underlying index and as a result, the same URL was returned in the next query. This resulted in 10 to 15% of URLs being unnecessarily re-fetched in a short delay. What #349 does is that after ack/failing, URLs are kept in the cache for an extra N-seconds, to give time for the changed to be reflected in the search results (this is of course configurable via es.status.ttl.purgatory).

Coming next?

The releases seem to come more and more frequently. It is not sure yet what the next one will have in store but I am sure the discussions at ApacheCon as well as constant stream of new users will provide new functionalities and bugfixes.

In the meantime and as usual, thanks to all contributors and users and happy crawling!


Monday 19 September 2016

What's new in StormCrawler 1.1

The 1.1 release comes 2 months after the previous one and is relatively lightweight by comparison. The main changes are :


Dependency upgrades

Jackson Databind (2.6.6) and Apache Storm (1.0.2)

Core

  • HTTP protocol : store response headers verbatim in metadata (#317) - used by the WARC module
  • FetcherBolt - added option to throttle based on number of URLs in queues (#311)
  • Conventional 'never-refetch' Date for nextFetchDate (#331)
  • Added metadata.lastProcessedDate
  • Deprecated StatusStreamBolt and copied as DummyIndexer

Elasticsearch

  • Bugfix Flush BulkProcessor before closing connection (#320)
  • Added per day / month metrics consumer (#327)

Archetype

  • Generate real jar name in README 
  • Proper handling of user-provided package names (#326)
  • POM for projects from archetypes should use Java 8 (#325)

There have been several minor changes as well. 

Remember that you can get regular updates about major commits on the project by following us on Twitter @stormcrawlerapi..

BTW there should be an exciting announcement in the next couple of weeks about a cool use of StormCrawler by a high-profile user, watch this space!

As usual, thanks to all contributors and users and happy crawling!

PS: If you are near the Bristol,  you might be interested in coming to the talk I'll be giving at Bristech on the Oct 6th.

NOTE

A patch release 1.1.1 has been published on the 21st Sept and includes #335 (thanks to Jeff Bolle for pointing it out).





Tuesday 12 July 2016

StormCrawler : the Coming of Age

I am very happy to announce the release of StormCrawler 1.0. It has taken a few years (and more specifically 791 commits from 15 contributors and 10 releases) to evolve from what was just an intuition to a piece of software which is now mature, stable and used in production by various companies.

The major release number reflects the version of Apache Storm, as we switched from Storm 0.10 to 1.0, however our minor number will not necessarily track the one used in Storm. The move to 1.0 also reflects the maturity of StormCrawler.

The main changes compared to the previous release are :
  • Moved to Storm 1.x (#295)
  • Upgrade to Java 8 (#308)
  • Renamed packages storm.crawler into stormcrawler (#306)
  • Added Flux equivalent to the example topology class (#286)
  • FetcherBolt simplify access to OutputCollector (#278)
  • JSoupParser detects mimetype with Tika #303
  • Elasticsearch : remote TTL from metrics index (#296)
  • Elasticsearch : sampler aggregation spout (#305)
  • Tika : Provide clues to Tika parser for indentification of mimetype (#302)
  • Use metadata keys last-modified and etag (#109)
  • URLFilter based on metadata (#312)
plus several minor changes and bug fixes.

Let's have a closer look at some of the changes above.

Flux

Flux is a very elegant resource for defining and deploying topologies on Apache Storm. The simple crawl topology generated by the archetype now contains a Flux equivalent of the Java topology class. This means that you don't need to know Java to define a topology but also that you don't need to recompile the jar every time you make a small change to the topology.

After calling 'mvn clean package' , you can start the topology in local mode with  

storm jar target/<INSERTJARNAMEHERE>.jar  org.apache.storm.flux.Flux --local crawler.flux

Sampler aggregation spout 

We added a new type of spout to the Elasticsearch module which uses the sampler aggregation - a new feature in Elasticsearch 2.x. This spout is useful for cases where the status index is very large as it reduces the time taken by the queries while preserving the diversity of URLs.

URL filter based on metadata

A new configurable URL filter based on metadata had been added and is included in the default topology generated by the archetype. This filter is ridiculously simple : it removes any outlinks based on the metadata of the source document. Imagine for instance that we get URLs from sitemaps files for a given site. We could decide not to follow the outlinks found in the leafs documents from the sitemap, which is a reasonable thing to do : if a site tells you what to index, there is a possibility that you'd only get noise / variants / duplicates by following the outlinks. Since leaf documents get the feature isSitemap with the value of false, we can configure the URL filter as follows :

{ "class": "com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter",
"name": "MetadataFilter",
"params": {
"isSitemap": "false"
}
}
This mechanism can be used for other things of course.

What next?

The next release will probably contain code and resources for fetching with the Selenium protocol or JbrowserDriver (#144). We might also improve the WARC related code. As usual the project evolves with the needs and contributions of the community.

Thanks

Since StormCrawler just passed a major milestone, it is a good time to thank all the committers, contributors past and present and users for helping make the project what it is today. I've had some very positive feedback recently from new users and I hope some of you will take the time to share their experiences with the rest of the community.

Happy crawling!

Julien





Friday 10 June 2016

What's new in StormCrawler 0.10?



The version 0.10 of Storm-Crawler has just been released.  It contains many improvements and bugfixes and we recommend all existing users to upgrade to it.

Here are the main changes :

Core

Apart from the usual dependency upgrades (Apache Tika 1.12, Jsoup 1.8.3) and various bugfixes (notably #280 and 293), we completely removed the configuration files (parse and URL filters) from the core module (#227) as these should be specified by the user or provided by the archetype. This fixes an old issue we were having with user files being possibly overwritten with the ones provided by the core jar when generating the big jar.

We also added a mechanism to allow custom scheduling of the URLs based on metadata (#283). This applies to fetched documents only for the time being and can be used to revisit some pages more frequently depending on their nature,  for instance news feeds (see below). 

It is now possible to configure a list of metadata to persist (metadata.persist) into the status storage but not transfer to the outlinks of a document (#293). This is useful for the _redirTo values that we now generate to track the target of redirections (#96). Until now this would have been passed on to the outlinks, which would have been wrong.

The JSoupParser now has the option of passing on documents which it can't handle to the default stream so that another bolt can try to deal with them (#266).  This can be useful e.g. to chain the JSoup parser - which generates a good DOM for XPath extraction, with the Tika one which gives rubbish DOM but can handle all sorts of file formats. JSoupParser also generates more complete DOMs, including inline Javascript and css nodes (#219).

New resources

LinkParseFilter takes Xpath expressions in the parsefilter configuration and allows to add the matching elements to the outlinks.

FeedParserBolt uses the ROME library to process news feeds and generate the outlinks accordingly.

There is a separate repository for resources to generate WARC files. These might be moved to the storm-crawler repository later on.

Elasticsearch

There has been loads of work done in this module. The main thing is that we upgraded the code to the current version 2.3.1 of Elasticsearch (#275), which should provide better performance and new functionalities. The Kibana dashboards have also been improved and can display generic metrics for Apache Storm (receive queues, memory heaps),  top hosts/domains on status dashboard (see below) and total bytes fetched on the crawl metrics dashboard.

Kibana dashboard showing the status counts and top hostnames

What's next?

There is already a branch to move the code to Storm 1.x and some related improvements (#278). We will keep improving the existing components, in particular add sampling to the AggregationSpout for Elasticsearch.

Thanks to all users and contributors who helped with this release. Remember that you can follow the project on Twitter @stormcrawlerapi.




Wednesday 4 November 2015

What's new in Storm-Crawler 0.7

Storm-Crawler 0.7 has been released yesterday. This release fixes some bugs and provides numerous improvements, we advise users to upgrade to it. Here are the main changes:

  • AbstractIndexingBolt to use status stream in declareOutputFields #190
  • Change Status to ERROR when FETCH_ERROR above threshold #202
  • FetcherBolt tracks cause of error in metadata
  • Add default config file in resources #193
  • FileSpout chokes on very large files #196
  • Use Maven-Shade everywhere #199
  • Ack tick tuples #194
  • Remove PrinterBolt and IndexerBolt, added StdOutStatusUpdater #187
  • Upgraded Tika to 1.11

This release contains many improvements to the Elasticsearch module :


  • Added README with a getting started section
  • IndexerBolt uses url as doc ID
  • ESSpout : maxSecSinceQueriedDate param to avoid deep paging
  • ElasticSearchSpout can random sort -> better diversity of URLs
  • ElasticSearchSpout implements de/activate, counter for time spent querying, configurable result size
  • Simple Kibana dashboards for metrics and status indices
  • Metadata as structured object. Implements #197
  • ES Spout - more metrics acked, failed, es queries and docs
  • ESSeedInjector topology
  • Index init script uses ttl for metrics
  • Upgraded ES version to 1.7.2

The SOLR module has also received some attention :
  • solr-metadata #210
  • Cleaning some documentation and typo issues
  • Remove outdated configuration options for solr module
We also improved the metrics by adding a PerSecondReducer (#209) which is used by the FetcherBolts to provide page and byte per second metrics. The metrics names and codes got also improved - notably the gauges for ESSpout and FetcherBolt.

These changes combined with the Kibana dashboard templates make it easy to monitor a crawl and get addition insights into its behaviour, as illustrated below.



Of course thanks to Storm's pluggable and versatile metrics mechanism, it is relatively easy to send metrics to other backends such as AWS Cloudwatch for instance.

Thanks to the various users and contributors who helped with this release.

Friday 5 June 2015

What's new in Storm-Crawler 0.5



We've just released the version 0.5 of Storm-Crawler, just over three months after the previous one. As you can read below, we've been pretty busy! The project got some great contributions from new users and is seeing an increase in adoption, which is very encouraging.

Metadata and Outlinks


One of the main improvements provided in the new release is the introduction of a Metadata object which replaces the Map<String,String[]> that were used everywhere in our code as well as the KeyValues utility class which manipulated such Maps. This makes the code a lot simpler and more elegant.

A new MetadataTransfer class has been added to (a)  determine what metadata should be kept e.g. when persisting the information about a URL in a StatusUpdaterBolt but also to (b) determine what metadata should be transferred from the source document to its outlinks. This is a very useful feature that gets used quite often in practice.

Speaking of outlinks, they now have a proper class for representing them where the anchor and metadata for a given target URL are kept. Note that the parser bolts populate the metadata using the MetadataTransfer class described above prior to passing them to the ParseFilters. This means that a given ParseFilter can modify the outlinks for a given page or create completely new ones.


JSoupParserBolt

We got a present from our committer Gui whose company have kindly donated the a parsing bolt based on JSoup. This is now the one we use by default, the one based on Tika has been moved to the external part of the code. If you are crawling non-HTML pages then you should be using the Tika-based parser, otherwise the JSoup one is a lot lighter (both in code and dependencies) and works better when extracting data with XPath.

Abstract classes for persistence

We also added many useful resources for writing recursive crawlers, in addition to the status stream that came with the previous release. These can be found in the com.digitalpebble.storm.crawler.persistence  package. In particular, we added a new AbstractStatusUpdaterBolt class. As the name suggests, this is meant to be extended to store the tuples coming from the status stream into some sort of storage (e.g. Elasticsearch, SOLR, Cassandra, HBase etc...). The abstract class keeps an internal cache for newly discovered URLs so that the same URL does not get updated more than once in the backend. Obviously this cache would not outlive the bolt if it died so this should be seen merely as an optimization and not a 100% reliable filter. The abstract class then calls a Scheduler, which is a pluggable mechanism to define when a given URL should be fetched next based on its metadata and status. The default scheduler simply relies on the configuration set by the user.

We also added a new AbstractIndexerBolt class, to simplify writing indexing bolts by allowing the users to specify what metadata to index via the configuration. This greatly simplifies writing an IndexerBolt.



Elasticsearch

These new classes above have been used for our Elasticsearch bolts and spout. We now have :
These 3 components allow to build a recursive crawler with Elasticsearch. We also added an example topology to illustrate how to do this as well as an init script which defines the schemas of the indices.

As a bonus we wrote a MetricsConsumer which can be plugged into the Storm metrics mechanism so that they get indexed in Elasticsearch. This would be typically used by Kibana as a way of monitoring the performance of the crawler with the metrics generated by the spouts and bolts e.g. bytes per second, pages fetched etc... I had suggested it to the elasticsearch-hadoop community but it hasn't attracted much interest so far.

We will probably provide a schema file for Kibana so that users can load a standard dashboard for displaying the metrics. We just need to wait for the next release of Kibana which will contain #1552.

Miscellaneous and next steps

We've replaced the old http protocol implementation we'd borrowed from Nutch with a brand new one based on Apache HttpClient. Less code to maintain and it is also more robust, particularly on https pages.


Apart from that, we improved our WIKI pages, upgraded some dependencies (Tika to 1.8, ES to 1.5.1, Storm to 0.9.4), added some resources (e.g. MaxDepthFilter), removed some deprecated ones (#126) and fixed numerous bugs. 



As I said, we've been pretty busy and it looks like this is set to continue with the 0.6 release. It will probably contain #117 as well as resources for Apache SOLR.

Thanks to everyone who contributed to this release in any way.


Friday 8 March 2013

Free your Nutch crawls with pluggable indexers

I have just committed what should be a very important new feature of the next 1.x release of Apache Nutch, namely the possibility to implement indexing backends via plugins. This is currently on the trunk only but should hopefully be ported to 2.x at some point. The Nutch-1047 JIRA issue contains a history of patches and discussions for this feature.

As you'll see by reading the explanations below, this is not the same thing as the indexing filters or the storage backends in Nutch 2.x.

Historically, Nutch used to manage its own Lucene indices itself and provide a web interface for querying them. Support for SOLR was added much later in the 1.0 release (NUTCH-442) and users had two separate commands for indexing directly with Lucene or sending the documents to SOLR, in which case the search could be done outside the Nutch search servers and directly with SOLR. We then decided to drop the Nutch search servers and the Lucene-based indexing altogether in Nutch 1.3 (NUTCH-837) and let the SOLR indexer become the only option. This was an excellent move as it greatly reduced the amount of code we had to look after and meant that we could focus on the crawling while benefiting from the advances in SOLR.

One of the nice things about Nutch is that most of its components are based on plugins. The actual plugin mechanism was borrowed from Eclipse and allows to have endpoints and extensions. Nutch has extension points for URLFilters, URLNormalizers, Parsers, Protocols, etc... The full list of Nutch extensions can be found here. Basically pretty much everything in Nutch is done via plugins and I found that most customisations of Nutch I do for my clients are usually implemented via plugins only.

As you've guessed, NUTCH-1047 is about having generic commands for indexing and handling the backend implementations via plugins. Instead of piggybacking the SOLR indexer code to send the documents to a different backend, one can now use the brand new generic IndexingJob and isolate the logic of how the documents are sent to the backend via an extension of the new IndexWriter endpoint in a custom plugin.

The IndexWriter interface is pretty straightforward :
public String describe();
public void open(JobConf job, String name) throws IOException;
public void write(NutchDocument doc) throws IOException;
public void delete(String key) throws IOException;
public void update(NutchDocument doc) throws IOException;
public void commit() throws IOException;
public void close() throws IOException;
Having this mechanism allows us to move most of the SOLR-specific code to the new indexer-solr plugin (and hopefully all of it as soon as we have a generic de-duplicator which could use the IndexWriter plugins) but more importantly will facilitate the implementation of popular indexing backends such as ElasticSearch or Amazon's CloudSearch service without making the core code of Nutch more complex. We frequently get people on the mailing list asking how to store the Nutch documents on such or such database and being able to do that in a plugin will definitely make it easier. It will also be a good way of storing Nutch documents as files, etc...

This is quite a big change to the architecture of Nutch but we tried to make it as transparent as possible for end users. The only indexer plugin currently available is a port of the existing code for SOLR and is activated by default. We left the old solr* commands and modified them so that they use the generic commands with the indexing plugins in the background so from a user point of view there should be no difference at all. 

There is already a JIRA for a text-based CSV indexing plugin and I expect that the ElasticSearch one will get rapid adoption.

I had been willing to find the time to work on this for quite some time and I'm very pleased it is now committed, thanks to the comments and reviews I got from my fellow Nutch developers. I look forward to getting more feedback and seeing it being used, extended, improved, etc...