Showing posts with label solr. Show all posts
Showing posts with label solr. Show all posts

Thursday, 18 October 2018

What's new in StormCrawler 1.11

I've just released StormCrawler 1.11, here are the main changes, some of which require modifications of your configuration.

Users should upgrade to this version as it fixes several bugs and adds loads of functionalities.

Dependency upgrades
  • Tika 1.19.1 (#606)
  • Elasticsearch 6.4.1 (#607)
  • SOLR 7.5 (#624)
  • OKHttp 3.11.0
Core

  • /bugfix/ FetcherBolts original metadata overwrites metadata returned by protocol (#636)
  • Override Globally Configured Accepts and Accepts-Language Headers Per-URL  (#634)
  • Support for cookies in okhttp implementation (#632)
  • AbstractHttpProtocol uses StringTabScheme to parse input into URL and Metadata  (#631)
  • Improve MimeType detection for interpreted server-side languages (#630)
  • /bugfix/ Custom intervals in Scheduler can't contain dots  (#616)
  • OKHTTP protocol trust all SSL certificates (#615)
  • HTTPClient protocol setDefaultMaxPerRoute based on max threads per queue (#594)
  • Fetcher Added byteLength to Metadata (#599)
  • URLFilters + ParseFilters refactoring (#593)
  • HTTPClient Add simple basic auth system (#589)
WARC

  • /bugfix/ WARCHdfsBolt writes zero byte files (#596)
SOLR
  • SOLR StatusUpdater use short status name (#627)
  • SOLRSpout log queries, time and number of results (#623)
  • SOLR spout - reuse nextFetchDate (#622)
  • Move reset.fetchdate.after to AbstractQueryingSpout (#628)
  • Abstract functionalities of spout implementations (#617) - see below
SQL
  • MetricsConsumer (#612)
  • Batch PreparedStatements in SQL status updater bolt, fixes (#610)
  • QLSpout group by hostname and get top N results (#609)
  • Harmonise param names for SQL (#619)
  • Move reset.fetchdate.after to AbstractQueryingSpout (#628)
  • Abstract functionalities of spout implementations (#617) - see below

Elasticsearch
  • /bugfix/ NPE in AggregationSpout when there is not any status index created (#597)
  • /bugfixNPE in CollapsingSpout (#595)
  • Added ability to implement custom indexes names based on metadata information (#591)
  • StatusMetricsBolt - Added check for avoid NPE when interacting with multi search response  (#598)
  • Change default value of es.status.reset.fetchdate.after (#590)
  • Log error if elastic search reports an unexpected problem (#575)
  • ES Wrapper for URLFilters implementing JSONResource (#588)
  • Move reset.fetchdate.after to AbstractQueryingSpout (#628)
  • Abstract functionalities of spout implementations (#617) - see below
As you've probably noticed, #617 affects ES, SOLR as well as SQL. The idea behind it is that the spout in these modules have a lot in common as they all query a backend for URLs to fetch. We moved some of the functionalities to a brand new class AbstractQueryingSpout, which greatly reduces the amount of code. The handling of the URL caching, TTL for the purgatory and min delay between queries is now done in that class. As a result, the spouts implementations have less to do and can focus on the specifics of getting the data from their respective backends. A nice side effect is that the SQL and SOLR spouts now benefit from some of the functionalities which were up to now only available in ES.

You will need to update your configuration to replace the elements which were specific to ES by the generic ones i.e. spout.reset.fetchdate.afterspout.ttl.purgatory and spout.min.delay.queries. These are also used by SOLR and SQL. 

Please note that these changes also impact some of the metrics names.

Coming next...

Storm 2.0.0 should be released soon, which is very exciting! There's a branch of StormCrawler which anticipates some of the changes, even though it hasn't been tested much yet. Give it a try if you want to be on the cutting edge!

I expect the SOLR and SQL backends to get further improvements and progressively catch up with our Elasticsearch resources.

Finally, our Bristol workshop next month is now full but there is one in Vilnius on 27/11. I'll also give a talk there the following day. If you are around, come and say hi and get yourself a StormCrawler sticker.

As usual, thanks to all contributors and users. Happy crawling!



Tuesday, 20 March 2018

What's new in StormCrawler 1.8

I have just released StormCrawler 1.8. As usual, here is a summary of the main changes:


Dependency updates
Core
  • Add option to send only N bytes of text to indexers #476
  • BasicURLNormalizer to optionally convert IDN host names to ASCII/Punycode #522
  • MemorySpout to generate tuples with DISCOVERED status #529
  • OKHttp configure type of proxy #530
  • http.content.limit inconsistent default to -1 #534
  • Track time spent in the FetcherBolt queues #535
  • Increase detect.charset.maxlength default value #537
  • FeedParserBolt: metadata added by parse filters not passed forward in topology #541
  • Use UTF-8 for input encoding of seeds (FileSpout) #542
  • Default URL filter: exclude localhost and private address spaces #543
  • URLStreamGrouping returns the taskIDs and not their index #547
WARC
  • Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #520
SOLR
  • Schema for status index needs date type for nextFetchDate #544
  • SOLR indexer: use field type text for content field #545
Elasticsearch
  • AggregationSpout fails with default value of es.status.bucket.field == _routing #521
  • Move to Elasticsearch RESTAPi #539
We recommend all users to move to this version as it fixes several bugs (#541#547) and adds some great new features. In particular, the use of the REST API for Elasticsearch, which makes the module future-proof but also easier to configure, but also #535 and #543.

As usual, thanks to all contributors and users. Happy crawling!

Thursday, 23 March 2017

What’s new in StormCrawler 1.4

StormCrawler 1.4 has just been released! As usual, all users are advised to upgrade to this version as it fixes some bugs and contains quite a few new functionalities.

Core dependencies upgrades

  • Httpclient 4.5.3
  • Storm 1.0.3 #437

Core module

  • JSoupParser does not dedup outlinks properly, #375
  • Custom schedule based on metadata for non-success pages, #386
  • Adaptive fetch scheduler #407
  • Sitemap: increased default offset for guessing + made it configurable  #409
  • Added URLFilterBolt + use it in ESSeedInjector #421
  • URLStreamGrouping 425
  • Better handling of redirections for HTTP robots #4372d16
  • HTTP Proxy over Basic Authentication #432
  • Improved metrics for status updater cache (hits and misses) #434
  • File protocol implementation #436
  • Added CollectionMetrics (used in ES MetricsConsumer + ES Spout, see below) #7d35acb

AWS

  • Added code for caching and retrieving content from AWS S3 #e16b66ef

SOLR

  • Basic upgrade to Solr 6.4.1
  • Use ConcurrentUpdateSolrClient; #183

Elasticsearch

  • Various changes to StatusUpdaterBolt
    Fixed bugs introduced in 1.3 (use of SHA ID), synchronisation issues, better logging, optimisation of docs sent and more robust handling of tuples waiting to be acked (#426). The most important change is a bug fix whereby the cache was never hit (#442) which had a large impact on performance.
  • Simplified README + removed bigjar profile from pom #414
  • Provide basic mapping for doc index #433
  • Simple Grafana dashboard for SC metrics, #380
  • Generate metrics about status counts, #389
  • Spouts report time taken by queries using CollectionMetric, #439 - as illustrated below
Spout query times displayed by Grafana
(illustrating the impact of SamplerAggregationSpout on a large status index )

Coming next?

As usual, it is not clear what the next release will contain but hopefully, we'll switch to Elasticsearch 5 (you can already take it from the branch es5.3) and provide resources for Selenium (see branch jBrowserDriver). As I pointed out in my previous post, getting early feedback on work in progress is a great way of contributing to the project.

We'll probably also upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser. We might move to one of the next releases of Apache Storm, where a recent contribution I made will make it possible to use Elasticsearch 5. Also, some of our StormCrawler code has been donated to Storm, which is great!

In the meantime and as usual, thanks to all contributors and users and happy crawling!

PS: I will be running a workshop in Berlin next month about StormCrawler, Storm in general and Elasticsearch


Wednesday, 4 November 2015

What's new in Storm-Crawler 0.7

Storm-Crawler 0.7 has been released yesterday. This release fixes some bugs and provides numerous improvements, we advise users to upgrade to it. Here are the main changes:

  • AbstractIndexingBolt to use status stream in declareOutputFields #190
  • Change Status to ERROR when FETCH_ERROR above threshold #202
  • FetcherBolt tracks cause of error in metadata
  • Add default config file in resources #193
  • FileSpout chokes on very large files #196
  • Use Maven-Shade everywhere #199
  • Ack tick tuples #194
  • Remove PrinterBolt and IndexerBolt, added StdOutStatusUpdater #187
  • Upgraded Tika to 1.11

This release contains many improvements to the Elasticsearch module :


  • Added README with a getting started section
  • IndexerBolt uses url as doc ID
  • ESSpout : maxSecSinceQueriedDate param to avoid deep paging
  • ElasticSearchSpout can random sort -> better diversity of URLs
  • ElasticSearchSpout implements de/activate, counter for time spent querying, configurable result size
  • Simple Kibana dashboards for metrics and status indices
  • Metadata as structured object. Implements #197
  • ES Spout - more metrics acked, failed, es queries and docs
  • ESSeedInjector topology
  • Index init script uses ttl for metrics
  • Upgraded ES version to 1.7.2

The SOLR module has also received some attention :
  • solr-metadata #210
  • Cleaning some documentation and typo issues
  • Remove outdated configuration options for solr module
We also improved the metrics by adding a PerSecondReducer (#209) which is used by the FetcherBolts to provide page and byte per second metrics. The metrics names and codes got also improved - notably the gauges for ESSpout and FetcherBolt.

These changes combined with the Kibana dashboard templates make it easy to monitor a crawl and get addition insights into its behaviour, as illustrated below.



Of course thanks to Storm's pluggable and versatile metrics mechanism, it is relatively easy to send metrics to other backends such as AWS Cloudwatch for instance.

Thanks to the various users and contributors who helped with this release.

Friday, 4 September 2015

What's new in Storm-Crawler 0.6

We have just released version 0.6 of Storm-Crawler, an open source web crawling SDK based on Apache Storm. Storm-Crawler provides resources for building scalable, low-latency web crawlers and is used in production at various companies.

We have added loads of improvements and bug fixes since our previous release last June, thanks to the efforts of the community. The activity around the project has been very steady and a new committer (Jorge Luis Betancourt) has joined our ranks. We also had contributions from various users, which is great.

Here are the main features of version 0.6.

Dependencies upgrades

  • Storm 0.9.5
  • crawler-commons 0.6
  • Tika 1.10

Code reorganisation

  • Organise external content as separate sub-modules #145
  • Removed external/metrics #160

API changes

  • ParseFilter from interface to abstract class #159
  • Parse can output more than one document #135

New features and resources

  • SimpleFetcherBolt  enforces politeness #181
  • New RobotsURLFilter #178
  • New ContentFilter to restrict text of document to XPath match #150
  • Adding support for using the canonical URL in the IndexerBolts #161
  • Improvement to SitemapParserBolt #143
  • Enforce robots meta instructions #148
  • Expand XPathFilter to accept a list of expressions as an argument #153
  • JSoupParserBolt does a basic check of the content type #151

External resources


The external (non-core) resources have been separated into discrete sub-modules as their number was getting larger. 

SOLR
Our brand new module for Apache SOLR (see #152) is comparable to the existing ElasticSearch equivalent and provides an IndexerBolt, a MetricsConsumer and a SOLRSpout and StatusUpdaterBolt.

SQL
Not all web crawls require scalable big data solutions. I conducted a survey of Apache Nutch users some time ago which showed that most people used it on a single machine and less than a million URL. These are often people crawling a single website. With that in mind, we added a spout and StatusUpdaterBolt implementations to use MySQL as a storage for URL status which is useful for small recursive crawls. See #172 for details.

AWS CloudSearch
There is also a new AWS module containing an IndexerBolt for Amazon CloudSearch (see #174). 



We hope that people find these improvements useful and would like to thank all users and contributors.


Friday, 8 March 2013

Free your Nutch crawls with pluggable indexers

I have just committed what should be a very important new feature of the next 1.x release of Apache Nutch, namely the possibility to implement indexing backends via plugins. This is currently on the trunk only but should hopefully be ported to 2.x at some point. The Nutch-1047 JIRA issue contains a history of patches and discussions for this feature.

As you'll see by reading the explanations below, this is not the same thing as the indexing filters or the storage backends in Nutch 2.x.

Historically, Nutch used to manage its own Lucene indices itself and provide a web interface for querying them. Support for SOLR was added much later in the 1.0 release (NUTCH-442) and users had two separate commands for indexing directly with Lucene or sending the documents to SOLR, in which case the search could be done outside the Nutch search servers and directly with SOLR. We then decided to drop the Nutch search servers and the Lucene-based indexing altogether in Nutch 1.3 (NUTCH-837) and let the SOLR indexer become the only option. This was an excellent move as it greatly reduced the amount of code we had to look after and meant that we could focus on the crawling while benefiting from the advances in SOLR.

One of the nice things about Nutch is that most of its components are based on plugins. The actual plugin mechanism was borrowed from Eclipse and allows to have endpoints and extensions. Nutch has extension points for URLFilters, URLNormalizers, Parsers, Protocols, etc... The full list of Nutch extensions can be found here. Basically pretty much everything in Nutch is done via plugins and I found that most customisations of Nutch I do for my clients are usually implemented via plugins only.

As you've guessed, NUTCH-1047 is about having generic commands for indexing and handling the backend implementations via plugins. Instead of piggybacking the SOLR indexer code to send the documents to a different backend, one can now use the brand new generic IndexingJob and isolate the logic of how the documents are sent to the backend via an extension of the new IndexWriter endpoint in a custom plugin.

The IndexWriter interface is pretty straightforward :
public String describe();
public void open(JobConf job, String name) throws IOException;
public void write(NutchDocument doc) throws IOException;
public void delete(String key) throws IOException;
public void update(NutchDocument doc) throws IOException;
public void commit() throws IOException;
public void close() throws IOException;
Having this mechanism allows us to move most of the SOLR-specific code to the new indexer-solr plugin (and hopefully all of it as soon as we have a generic de-duplicator which could use the IndexWriter plugins) but more importantly will facilitate the implementation of popular indexing backends such as ElasticSearch or Amazon's CloudSearch service without making the core code of Nutch more complex. We frequently get people on the mailing list asking how to store the Nutch documents on such or such database and being able to do that in a plugin will definitely make it easier. It will also be a good way of storing Nutch documents as files, etc...

This is quite a big change to the architecture of Nutch but we tried to make it as transparent as possible for end users. The only indexer plugin currently available is a port of the existing code for SOLR and is activated by default. We left the old solr* commands and modified them so that they use the generic commands with the indexing plugins in the background so from a user point of view there should be no difference at all. 

There is already a JIRA for a text-based CSV indexing plugin and I expect that the ElasticSearch one will get rapid adoption.

I had been willing to find the time to work on this for quite some time and I'm very pleased it is now committed, thanks to the comments and reviews I got from my fellow Nutch developers. I look forward to getting more feedback and seeing it being used, extended, improved, etc... 


Friday, 21 October 2011

Nutch hosting and monitoring

We now provide hosting and monitoring services for Apache Nutch.

For a fixed price, we will set up, run and monitor your Nutch crawler and report on its progress. The cost of the servers is included in the offer and their hardware specs are superior to what you get from Amazon EC2, without long term commitment as the service is on a monthly basis only.
The price depends on the size of the cluster as well as the complexity of the crawl.


If you use Nutch to feed documents to a seach engine, we can also monitor and host your SOLR instances for you!

Tuesday, 22 March 2011

Search for US properties with SOLR and Maptimize

Our clients 5k50 have recently opened a preview of their real-estate search system which is based on Apache SOLR and Maptimize. Maptimize is a very nice tool which which manages the display of data on Google Maps by merging markers which are geographically close together.

We initially audited the existing SOLR setup then redesigned it to add more functionalities and optimise the search speed.  The search itself is an interesting mix of map-driven filtering with SOLR queries and faceting. Any changes to the map (click on a cluster, zoom in/out) are reflected in the search results and facets and vice-versa.

Navia is a nice showcase for some of the most commonly used features of SOLR (i.e. faceting, more-like-this, autocompletion) and has a great identity thanks to its mix of geo and text search.  It is currently in beta mode so we can expect a few more improvements over the next few weeks.

And please feel free to give it a try so that we can get plenty of data on the performance :-)

Saturday, 19 March 2011

DigitalPebble is hiring!

We are looking for a candidate with the following skills and expertise :
  • strong background in NLP and Java
  • GATE, experience of writing plugins and PRs, excellent knowledge of JAPE
  • IE, Linked Data, Ontologies
  • statistical approaches and machine learning
  • large scale computing with Hadoop
  • knowledge of the following technologies / tools : Lucene, SOLR, NoSQL, Tika, UIMA, Mahout
  • good social and presentation skills
  • good spoken and written English, knowledge of other languages would be a plus
  • taste for challenges and problem solving

    DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering.

    More details on our activities can be found on our website. We would consider candidates working remotely with occasional travel to Bristol and our clients in UK and Europe. Being located in or near Bristol would be a plus.

    This job is an opportunity to get involved in the growth of a small company, work on interesting projects and take part in various Apache related projects and events. Bristol is also a great place to live.


   Please send your CV and cover letter before the 15th April 2011 to job@digitalpebble.com


    Best regards,

    Julien Nioche

Thursday, 26 August 2010

Using Payloads with DisMaxQParser in SOLR

Payloads are a good way of controlling the scores in SOLR/Lucene.

This post by Grant Ingersoll gives a good introduction to payloads, I also found http://www.ultramagnus.org/?p=1 pretty useful. 

What I will describe here is how to use the payloads and have the functionalities of the DisMaxQParser in SOLR.

SOLR already has a field type for analysing payloads 




and we can also define a custom Similarity to use with the payloads



 
then specify this in the SOLR schema

<!-- schema.xml -->
<similarity class="uk.org.company.solr.PayloadSimilarity" />


 
So far so good. We now need a QueryParser plugin in order to use the payloads in the search and as mentioned above, I want to keep the functionalities of the DisMaxQueryParser.
The problem is that we need to specify PayloadTermQuery objects instead of TermQueries which is down deep in the object hierarchies and cannot AFAIK be modified simply from DismaxQueryParser.
I have implemented a modified version of DismaxQueryParser which rewrites the main part of the query (a.k.a userQuery in the implementation) and substitutes the TermQueries with PayloadTermQueries.
First we'll create a QParserPlugin 




which does not do much but simply exposes the PLDisMaxQueryParser which is a modified version of the standard DisMaxQueryParser but with PayloadQuery objects.




Once these 3 classes have been compiled, jarred and put in the classpath of SOLR, we must add 



 to solrconfig.xml.
 
then specify for the requestHandler : 
 
<str name="defType">payload</str>
 
<!-- plf : comma separated list of field names --> 
 <str name="plf">
  payloads
 </str>
 
The fields listed in the parameter plf will be queried with Payload query objects.  Remember that you can use &debugQuery=true to get the details of the scores and check that the payloads are being used.