DigitalPebble's Blog: solr

Showing posts with label solr. Show all posts

Thursday, 18 October 2018

What's new in StormCrawler 1.11

I've just released StormCrawler 1.11, here are the main changes, some of which require modifications of your configuration.

Users should upgrade to this version as it fixes several bugs and adds loads of functionalities.

Dependency upgrades

Tika 1.19.1 (#606)
Elasticsearch 6.4.1 (#607)
SOLR 7.5 (#624)
OKHttp 3.11.0

Core

/bugfix/ FetcherBolts original metadata overwrites metadata returned by protocol (#636)
Override Globally Configured Accepts and Accepts-Language Headers Per-URL (#634)
Support for cookies in okhttp implementation (#632)
AbstractHttpProtocol uses StringTabScheme to parse input into URL and Metadata (#631)
Improve MimeType detection for interpreted server-side languages (#630)
/bugfix/ Custom intervals in Scheduler can't contain dots (#616)
OKHTTP protocol trust all SSL certificates (#615)
HTTPClient protocol setDefaultMaxPerRoute based on max threads per queue (#594)
Fetcher Added byteLength to Metadata (#599)
URLFilters + ParseFilters refactoring (#593)
HTTPClient Add simple basic auth system (#589)

WARC

/bugfix/ WARCHdfsBolt writes zero byte files (#596)

SOLR

SOLR StatusUpdater use short status name (#627)
SOLRSpout log queries, time and number of results (#623)
SOLR spout - reuse nextFetchDate (#622)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

SQL

MetricsConsumer (#612)
Batch PreparedStatements in SQL status updater bolt, fixes (#610)
QLSpout group by hostname and get top N results (#609)
Harmonise param names for SQL (#619)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

Elasticsearch

/bugfix/ NPE in AggregationSpout when there is not any status index created (#597)
/bugfix/ NPE in CollapsingSpout (#595)
Added ability to implement custom indexes names based on metadata information (#591)
StatusMetricsBolt - Added check for avoid NPE when interacting with multi search response (#598)
Change default value of es.status.reset.fetchdate.after (#590)
Log error if elastic search reports an unexpected problem (#575)
ES Wrapper for URLFilters implementing JSONResource (#588)
Move reset.fetchdate.after to AbstractQueryingSpout (#628)
Abstract functionalities of spout implementations (#617) - see below

As you've probably noticed, #617 affects ES, SOLR as well as SQL. The idea behind it is that the spout in these modules have a lot in common as they all query a backend for URLs to fetch. We moved some of the functionalities to a brand new class AbstractQueryingSpout, which greatly reduces the amount of code. The handling of the URL caching, TTL for the purgatory and min delay between queries is now done in that class. As a result, the spouts implementations have less to do and can focus on the specifics of getting the data from their respective backends. A nice side effect is that the SQL and SOLR spouts now benefit from some of the functionalities which were up to now only available in ES.

You will need to update your configuration to replace the elements which were specific to ES by the generic ones i.e. spout.reset.fetchdate.after, spout.ttl.purgatory and spout.min.delay.queries. These are also used by SOLR and SQL.

Please note that these changes also impact some of the metrics names.

Coming next...

Storm 2.0.0 should be released soon, which is very exciting! There's a branch of StormCrawler which anticipates some of the changes, even though it hasn't been tested much yet. Give it a try if you want to be on the cutting edge!

I expect the SOLR and SQL backends to get further improvements and progressively catch up with our Elasticsearch resources.

Finally, our Bristol workshop next month is now full but there is one in Vilnius on 27/11. I'll also give a talk there the following day. If you are around, come and say hi and get yourself a StormCrawler sticker.

As usual, thanks to all contributors and users. Happy crawling!

Tuesday, 20 March 2018

What's new in StormCrawler 1.8

I have just released StormCrawler 1.8. As usual, here is a summary of the main changes:

Dependency updates

Storm 1.2.1 #531
SOLR 7.2.1 #528
Tika 1.17 #518
Elasticsearch 6.2.2 #525 and #539

Core

Add option to send only N bytes of text to indexers #476
BasicURLNormalizer to optionally convert IDN host names to ASCII/Punycode #522
MemorySpout to generate tuples with DISCOVERED status #529
OKHttp configure type of proxy #530
http.content.limit inconsistent default to -1 #534
Track time spent in the FetcherBolt queues #535
Increase detect.charset.maxlength default value #537
FeedParserBolt: metadata added by parse filters not passed forward in topology #541
Use UTF-8 for input encoding of seeds (FileSpout) #542
Default URL filter: exclude localhost and private address spaces #543
URLStreamGrouping returns the taskIDs and not their index #547

WARC

Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #520

SOLR

Schema for status index needs date type for nextFetchDate #544
SOLR indexer: use field type text for content field #545

Elasticsearch

AggregationSpout fails with default value of es.status.bucket.field == _routing #521
Move to Elasticsearch RESTAPi #539

We recommend all users to move to this version as it fixes several bugs (#541, #547) and adds some great new features. In particular, the use of the REST API for Elasticsearch, which makes the module future-proof but also easier to configure, but also #535 and #543.

As usual, thanks to all contributors and users. Happy crawling!

Thursday, 23 March 2017

What’s new in StormCrawler 1.4

StormCrawler 1.4 has just been released! As usual, all users are advised to upgrade to this version as it fixes some bugs and contains quite a few new functionalities.

Core dependencies upgrades

Httpclient 4.5.3
Storm 1.0.3 #437

Core module

JSoupParser does not dedup outlinks properly, #375
Custom schedule based on metadata for non-success pages, #386
Adaptive fetch scheduler #407
Sitemap: increased default offset for guessing + made it configurable #409
Added URLFilterBolt + use it in ESSeedInjector #421
URLStreamGrouping 425
Better handling of redirections for HTTP robots #4372d16
HTTP Proxy over Basic Authentication #432
Improved metrics for status updater cache (hits and misses) #434
File protocol implementation #436
Added CollectionMetrics (used in ES MetricsConsumer + ES Spout, see below) #7d35acb

AWS

Added code for caching and retrieving content from AWS S3 #e16b66ef

SOLR

Basic upgrade to Solr 6.4.1
Use ConcurrentUpdateSolrClient; #183

Elasticsearch

Various changes to StatusUpdaterBolt

Fixed bugs introduced in 1.3 (use of SHA ID), synchronisation issues, better logging, optimisation of docs sent and more robust handling of tuples waiting to be acked (#426). The most important change is a bug fix whereby the cache was never hit (#442) which had a large impact on performance.

Simplified README + removed bigjar profile from pom #414
Provide basic mapping for doc index #433
Simple Grafana dashboard for SC metrics, #380
Generate metrics about status counts, #389
Spouts report time taken by queries using CollectionMetric, #439 - as illustrated below

Spout query times displayed by Grafana
(illustrating the impact of SamplerAggregationSpout on a large status index )

Coming next?

As usual, it is not clear what the next release will contain but hopefully, we'll switch to Elasticsearch 5 (you can already take it from the branch es5.3) and provide resources for Selenium (see branch jBrowserDriver). As I pointed out in my previous post, getting early feedback on work in progress is a great way of contributing to the project.

We'll probably also upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser. We might move to one of the next releases of Apache Storm, where a recent contribution I made will make it possible to use Elasticsearch 5. Also, some of our StormCrawler code has been donated to Storm, which is great!

In the meantime and as usual, thanks to all contributors and users and happy crawling!

PS: I will be running a workshop in Berlin next month about StormCrawler, Storm in general and Elasticsearch

https://www.eventbrite.co.uk/e/introduction-to-web-crawling-with-stormcrawler-and-elasticsearch-tickets-30927257259

Wednesday, 4 November 2015

What's new in Storm-Crawler 0.7

Storm-Crawler 0.7 has been released yesterday. This release fixes some bugs and provides numerous improvements, we advise users to upgrade to it. Here are the main changes:

AbstractIndexingBolt to use status stream in declareOutputFields #190

Change Status to ERROR when FETCH_ERROR above threshold #202

FetcherBolt tracks cause of error in metadata

Add default config file in resources #193

FileSpout chokes on very large files #196

Use Maven-Shade everywhere #199

Ack tick tuples #194

Remove PrinterBolt and IndexerBolt, added StdOutStatusUpdater #187

Upgraded Tika to 1.11

This release contains many improvements to the Elasticsearch module :

Added README with a getting started section

IndexerBolt uses url as doc ID

ESSpout : maxSecSinceQueriedDate param to avoid deep paging

ElasticSearchSpout can random sort -> better diversity of URLs

ElasticSearchSpout implements de/activate, counter for time spent querying, configurable result size

Simple Kibana dashboards for metrics and status indices

Metadata as structured object. Implements #197

ES Spout - more metrics acked, failed, es queries and docs

ESSeedInjector topology

Index init script uses ttl for metrics

Upgraded ES version to 1.7.2

The SOLR module has also received some attention :

solr-metadata #210

Cleaning some documentation and typo issues

Remove outdated configuration options for solr module

We also improved the metrics by adding a PerSecondReducer (#209) which is used by the FetcherBolts to provide page and byte per second metrics. The metrics names and codes got also improved - notably the gauges for ESSpout and FetcherBolt.

These changes combined with the Kibana dashboard templates make it easy to monitor a crawl and get addition insights into its behaviour, as illustrated below.

Of course thanks to Storm's pluggable and versatile metrics mechanism, it is relatively easy to send metrics to other backends such as AWS Cloudwatch for instance.

Thanks to the various users and contributors who helped with this release.

Friday, 4 September 2015

What's new in Storm-Crawler 0.6

We have just released version 0.6 of Storm-Crawler, an open source web crawling SDK based on Apache Storm. Storm-Crawler provides resources for building scalable, low-latency web crawlers and is used in production at various companies.

We have added loads of improvements and bug fixes since our previous release last June, thanks to the efforts of the community. The activity around the project has been very steady and a new committer (Jorge Luis Betancourt) has joined our ranks. We also had contributions from various users, which is great.

Here are the main features of version 0.6.

Dependencies upgrades

Storm 0.9.5
crawler-commons 0.6
Tika 1.10

Code reorganisation

Organise external content as separate sub-modules #145
Removed external/metrics #160

API changes

ParseFilter from interface to abstract class #159

Parse can output more than one document #135

New features and resources

SimpleFetcherBolt enforces politeness #181
New RobotsURLFilter #178
New ContentFilter to restrict text of document to XPath match #150
Adding support for using the canonical URL in the IndexerBolts #161
Improvement to SitemapParserBolt #143
Enforce robots meta instructions #148
Expand XPathFilter to accept a list of expressions as an argument #153
JSoupParserBolt does a basic check of the content type #151

External resources

The external (non-core) resources have been separated into discrete sub-modules as their number was getting larger.

SOLR

Our brand new module for Apache SOLR (see #152) is comparable to the existing ElasticSearch equivalent and provides an IndexerBolt, a MetricsConsumer and a SOLRSpout and StatusUpdaterBolt.

SQL

Not all web crawls require scalable big data solutions. I conducted a survey of Apache Nutch users some time ago which showed that most people used it on a single machine and less than a million URL. These are often people crawling a single website. With that in mind, we added a spout and StatusUpdaterBolt implementations to use MySQL as a storage for URL status which is useful for small recursive crawls. See #172 for details.

AWS CloudSearch

There is also a new AWS module containing an IndexerBolt for Amazon CloudSearch (see #174).

We hope that people find these improvements useful and would like to thank all users and contributors.

Friday, 8 March 2013

Free your Nutch crawls with pluggable indexers

I have just committed what should be a very important new feature of the next 1.x release of Apache Nutch, namely the possibility to implement indexing backends via plugins. This is currently on the trunk only but should hopefully be ported to 2.x at some point. The Nutch-1047 JIRA issue contains a history of patches and discussions for this feature.

As you'll see by reading the explanations below, this is not the same thing as the indexing filters or the storage backends in Nutch 2.x.

Historically, Nutch used to manage its own Lucene indices itself and provide a web interface for querying them. Support for SOLR was added much later in the 1.0 release (NUTCH-442) and users had two separate commands for indexing directly with Lucene or sending the documents to SOLR, in which case the search could be done outside the Nutch search servers and directly with SOLR. We then decided to drop the Nutch search servers and the Lucene-based indexing altogether in Nutch 1.3 (NUTCH-837) and let the SOLR indexer become the only option. This was an excellent move as it greatly reduced the amount of code we had to look after and meant that we could focus on the crawling while benefiting from the advances in SOLR.

One of the nice things about Nutch is that most of its components are based on plugins. The actual plugin mechanism was borrowed from Eclipse and allows to have endpoints and extensions. Nutch has extension points for URLFilters, URLNormalizers, Parsers, Protocols, etc... The full list of Nutch extensions can be found here. Basically pretty much everything in Nutch is done via plugins and I found that most customisations of Nutch I do for my clients are usually implemented via plugins only.

As you've guessed, NUTCH-1047 is about having generic commands for indexing and handling the backend implementations via plugins. Instead of piggybacking the SOLR indexer code to send the documents to a different backend, one can now use the brand new generic IndexingJob and isolate the logic of how the documents are sent to the backend via an extension of the new IndexWriter endpoint in a custom plugin.

The IndexWriter interface is pretty straightforward :

public String describe();

public void open(JobConf job, String name) throws IOException;

public void write(NutchDocument doc) throws IOException;

public void delete(String key) throws IOException;

public void update(NutchDocument doc) throws IOException;

public void commit() throws IOException;

public void close() throws IOException;

Having this mechanism allows us to move most of the SOLR-specific code to the new indexer-solr plugin (and hopefully all of it as soon as we have a generic de-duplicator which could use the IndexWriter plugins) but more importantly will facilitate the implementation of popular indexing backends such as ElasticSearch or Amazon's CloudSearch service without making the core code of Nutch more complex. We frequently get people on the mailing list asking how to store the Nutch documents on such or such database and being able to do that in a plugin will definitely make it easier. It will also be a good way of storing Nutch documents as files, etc...

This is quite a big change to the architecture of Nutch but we tried to make it as transparent as possible for end users. The only indexer plugin currently available is a port of the existing code for SOLR and is activated by default. We left the old solr* commands and modified them so that they use the generic commands with the indexing plugins in the background so from a user point of view there should be no difference at all.

There is already a JIRA for a text-based CSV indexing plugin and I expect that the ElasticSearch one will get rapid adoption.

I had been willing to find the time to work on this for quite some time and I'm very pleased it is now committed, thanks to the comments and reviews I got from my fellow Nutch developers. I look forward to getting more feedback and seeing it being used, extended, improved, etc...

Friday, 21 October 2011

Nutch hosting and monitoring

We now provide hosting and monitoring services for Apache Nutch.

For a fixed price, we will set up, run and monitor your Nutch crawler and report on its progress. The cost of the servers is included in the offer and their hardware specs are superior to what you get from Amazon EC2, without long term commitment as the service is on a monthly basis only.
The price depends on the size of the cluster as well as the complexity of the crawl.

If you use Nutch to feed documents to a seach engine, we can also monitor and host your SOLR instances for you!

Tuesday, 22 March 2011

Search for US properties with SOLR and Maptimize

Our clients 5k50 have recently opened a preview of their real-estate search system which is based on Apache SOLR and Maptimize. Maptimize is a very nice tool which which manages the display of data on Google Maps by merging markers which are geographically close together.

We initially audited the existing SOLR setup then redesigned it to add more functionalities and optimise the search speed. The search itself is an interesting mix of map-driven filtering with SOLR queries and faceting. Any changes to the map (click on a cluster, zoom in/out) are reflected in the search results and facets and vice-versa.

Navia is a nice showcase for some of the most commonly used features of SOLR (i.e. faceting, more-like-this, autocompletion) and has a great identity thanks to its mix of geo and text search. It is currently in beta mode so we can expect a few more improvements over the next few weeks.

And please feel free to give it a try so that we can get plenty of data on the performance :-)

Saturday, 19 March 2011

DigitalPebble is hiring!

We are looking for a candidate with the following skills and expertise :

strong background in NLP and Java
GATE, experience of writing plugins and PRs, excellent knowledge of JAPE
IE, Linked Data, Ontologies
statistical approaches and machine learning
large scale computing with Hadoop
knowledge of the following technologies / tools : Lucene, SOLR, NoSQL, Tika, UIMA, Mahout
good social and presentation skills
good spoken and written English, knowledge of other languages would be a plus
taste for challenges and problem solving

    DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering.

    More details on our activities can be found on our website. We would consider candidates working remotely with occasional travel to Bristol and our clients in UK and Europe. Being located in or near Bristol would be a plus.

    This job is an opportunity to get involved in the growth of a small company, work on interesting projects and take part in various Apache related projects and events. Bristol is also a great place to live.

   Please send your CV and cover letter before the 15th April 2011 to job@digitalpebble.com

    Best regards,

    Julien Nioche

Thursday, 26 August 2010

Using Payloads with DisMaxQParser in SOLR

Payloads are a good way of controlling the scores in SOLR/Lucene.

This post by Grant Ingersoll gives a good introduction to payloads, I also found http://www.ultramagnus.org/?p=1 pretty useful.

What I will describe here is how to use the payloads and have the functionalities of the DisMaxQParser in SOLR.

SOLR already has a field type for analysing payloads

and we can also define a custom Similarity to use with the payloads

then specify this in the SOLR schema

<similarity class="uk.org.company.solr.PayloadSimilarity" />

So far so good. We now need a QueryParser plugin in order to use the payloads in the search and as mentioned above, I want to keep the functionalities of the DisMaxQueryParser.

The problem is that we need to specify PayloadTermQuery objects instead of TermQueries which is down deep in the object hierarchies and cannot AFAIK be modified simply from DismaxQueryParser.
I have implemented a modified version of DismaxQueryParser which rewrites the main part of the query (a.k.a userQuery in the implementation) and substitutes the TermQueries with PayloadTermQueries.

First we'll create a QParserPlugin

which does not do much but simply exposes the PLDisMaxQueryParser which is a modified version of the standard DisMaxQueryParser but with PayloadQuery objects.

Once these 3 classes have been compiled, jarred and put in the classpath of SOLR, we must add

 to solrconfig.xml.

then specify for the requestHandler :

<str name="defType">payload</str>

<!-- plf : comma separated list of field names --> 
 <str name="plf">
  payloads
 </str>

 
The fields listed in the parameter plf will be queried with Payload query objects.  Remember that you can use &debugQuery=true to get the details of the scores and check that the payloads are being used.