DigitalPebble's Blog: storm

Showing posts with label storm. Show all posts

Thursday, 23 November 2023

Meet the StormCrawler users: Q&A with the OpenWebSearch.eu project

It has been a while since our first “Meet the StormCrawler users” blog and since StormCrawler is still going strong and used by a wide variety of users, we are delighted to put the spotlight on one of the most exciting projects that uses it. Our guests today are Michael Dinzinger and Saber Zerhoudi, both from the University of Passau in Germany.

Can you please introduce yourselves and the project you are working on?

Hello, we are Saber and Michael, both PhD students in Passau. Since September 2022, we have been working on OpenWebSearch.eu, a European research project, in which people from now more than 15 participating institutes collaborate on building an Open Web Index.

Our task here at Uni Passau is the collaborative and resource-efficient crawling, which is the first technical step in building the Index (see figure below). The end result are Metadata and Index files, currently in Parquet and CIFF format. These are hosted on the project partners’ shared infrastructure and will soon be available for download.

Figure 1 Open Web Search Pipeline

By providing these files to our users, we want to empower them to work on new search applications and tap the web as a resource for their research and business ideas. The Open Web Index is in this sense a truly open, transparent and legally compliant alternative to the proprietary Web Indices of the big tech gatekeepers.

How do you use StormCrawler and URLFrontier?

We use StormCrawler to build our own crawling pipelines by configuring - and in some cases extending - the already existing software components. We particularly appreciate its high customizability, because we use the framework for classic discovery crawling, which we need to feed the Open Web Index, and also for more task-specific and research-oriented crawling.

A major challenge in our work is the heterogeneous infrastructure, on top of which we are building the crawling system. The different infrastructure partners in the project provide a large set of commodity hardware, which is hosted across different datacenters and dispersed over Europe. Despite the geographic distribution of the machines, all nodes should collaborate on the same shared crawl. For that purpose, we deploy URLFrontier in a central computing site. The Frontier services distribute the crawl space and communicate with the remote crawlers in order to provide them with a continuous flow of URLs to be fetched (see figure below). URLFrontier can use different backends to store the data, we chose to use one leveraging another open source project, OpenSearch.

What results did you get so far?

The crawling is currently still in its experimental phase, but fortunately, we have already achieved some interesting and promising numbers. For example, we are running three StormCrawler instances at the moment. These have fetched over 200M web pages within a single week and each of them produced between 200 and 250 GiB of WARC files per day. The crawled data is filtered and enriched with meta information, before it is provided as index and metadata files to the public. In the next steps, we want to upscale the crawling to several terabytes and improve the prioritisation of crawl URLs to get a strong focus on high-quality pages.

It is definitely worth mentioning that the WARC module of StormCrawler helped us a lot. In order to get our indexing pipeline going, we started with copying WARC files from CommonCrawl, before we were able to crawl on our own.

Why did you choose StormCrawler?

We chose StormCrawler primarily for its compatibility with URLFrontier. This synergy made it an excellent starting point for developing a large-scale, coordinated, and distributed crawling cluster. Additionally, the open-source nature of the project and its active community influenced our decision. It was crucial for us to be supported by a network of developers who continuously enhance the core software and provide assistance or solutions when needed.

Did you make any contributions to it? Any advice you could give to future users and contributors?

Yes, we have contributed to StormCrawler by creating a forked version named OWLer.

This version includes several improvements and additions we deemed necessary for our project. We've implemented extended topologies for various purposes and added a classification component to categorise and annotate URLs based on either just the URL or the URL plus website content. It serves as a labelling tool for the crawler's content.

URLFrontier has also been expanded to accommodate these modifications, enabling crawlers to specialise in topics, languages, genres, etc.

Moreover, we have introduced a "Crawling-On-Demand" service. Users can register their requests on the new OWler webpage by specifying a list of seed URLs and additional information. Upon submission, a StormCrawler instance is deployed in our infrastructure, fetching and storing the content as WARC files in a dedicated S3 bucket. Once completed, users receive a link to download the WARC files via email. URLFrontier tracks the progress of these crawls.

What's next?

We are currently expanding our "Crawling-On-Demand" service to include "Indexing-On-Demand." Users will be able to specify a list of seed URLs and additional tags. We will then search our database of previously crawled and processed URLs for recent content matching this list and provide it to the user in an indexed format.

LinkedIn: openwebsearch-eu
X: OpenWebSearchEU
Mastodon: @openwebsearcheu@suma-ev.social

Tuesday, 11 January 2022

What's new in StormCrawler 2.2

StormCrawler 2.2 has just been released. This marks the beginning of having releases only for 2.x, 1.18 was the last release for the 1.x branch which is now discontinued. In case you were wondering why there was no "What's new in StormCrawler 2.1", it is simply that it contained the same modifications as 1.18 and did not get its own announcement.
This version contains many bugfixes, as usual, users are advised to upgrade to this version.
Happy crawling and thanks to our sponsors, contributors and users! PS: I am tempted to run a workshop on webcrawling with StormCrawler at the BigData conference in Vilnius in November. Anyone interested? If so please get in touch and let me know what you'd like to learn about. https://bigdataconference.eu/

What's new in StormCrawler 1.18

StormCrawler 1.18 has just been released. Since the previous version dates from nearly 10 months ago, the number of changes is rather large (see below).

This version contains many bugfixes, as usual, users are advised to upgrade to this version. One of the noticeable new features is module for URLFrontier (if you haven't checked it up, do so right now!); I will publish a tutorial on how to use it soon.

1.18 is also likely to be the last release based an Apache Storm 1.x, our 2.x branch will become master as soon as I have released 2.1.

Happy crawling and thanks to our sponsors, contributors and users!

Please welcome StormCrawler 2.0

Nearly 6 years after its initial release and after another 32 releases, StormCrawler has just reached version 2.0!

This is similar to what we did 4 years ago when 1.0 was released, in that the change of major version reflects the version of Apache Storm that StormCrawler is based on. This is not a major refactoring of StormCrawler in any way, although some minor changes can be found, mainly in the way the topologies are submitted. These changes are documented in the READMEs generated by our archetypes.

In terms of functionalities and behavior, StormCrawler 2.0 is similar to the version 1.17 released a few minutes ago.

I expect to keep both branches in parallel for a bit, at least until StormCrawler 2.0 has been sufficiently tested and is used by the majority of our users.

The change to Apache Storm 2 is not just a way of future-proofing StormCrawler, since version 2 is the current branch in Apache Storm. By adopting Storm 2, we are also getting a platform 100% Java making debugging and possible contributions to Apache Storm itself, and we also benefit from Storm's recent improvements such as improved performance and better backpressure model.

I am looking forward to getting feedback (and bugfixes) from the StormCrawler community. Please give StormCrawler 2.0 a try if you can.

Happy crawling!

What's new in StormCrawler 1.17

I have just released StormCrawler 1.17. As you can see in the list below, this contains important bugfixes and improvements. For this reason, we recommend that all users upgrade to this version, however, please check the breaking changes below if you apply it to an existing crawl.

Dependency upgrades

Various dependency upgrades #808
CrawlerCommons 1.1 dependency #807
Tika 1.24.1 #797
Jackson-databind #803 #793 #798

Core

Use regular expressions for custom number of threads per queue fetcher #788
/!breaking!/ Prefix protocol metadata #789
Basic authentication for OKHTTP #792
Utility to debug / test parsefilters #794
/!breaking!/ Remove deprecated methods and fields enhancement #791
AdaptiveScheduler to set last-modified time in metadata #777 #812
/bugfix/ _fetch.exception_ key should be removed from metadata if subsequent fetches are successful #813
/bugfix/ SimpleFetcherBolt maxThrottleSleepMSec not deactivated #814
/!breaking!/ Index pages with content="noindex,follow" meta tag #750
Enable extension parsing for SitemapParser enhancement parser #749 #815

WARC

Implement WARC spout #755 #799

Elasticsearch

/bugfix/ AggregationSpout error due SimpleDateFormat not thread safe #809
/bugfix/ IndexerBolt issue causing ack failures #801
Allow ES to connect over a proxy #787

Of the breaking changes above, #789 is particularly important. If you want to use SC 1.17 on an existing crawl, make sure you add

protocol.md.prefix: ""

to the configuration. Similarly, http.skip.robots has changed to http.robots.file.skip

Thanks to all contributors and users! Happy crawling!

PS: something equally exciting is coming next ;-)

Thursday, 16 January 2020

What's new in StormCrawler 1.16?

Happy new year!

StormCrawler 1.16 was released a couple of days ago. You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/26?closed=1

As usual, we recommend that all users upgrade to this version as it contains important fixes and performance improvements.

Dependency upgrades

Tika 1.23 (#771)
ES 7.5.0 (#770)
jackson-databind from 2.9.9.2 to 2.9.10.1 dependency (#767)

Core

OKHttp configure authentication for proxies (#751)
Make URLBuffer configurable + AbstractURLBuffer uses URLPartitioner (#754)
/bugfix/ okhttp protocol: reliably mark trimmed content because of content limit (#757)
/!breaking!/ urlbuffer code in a separate package + 2 new implementations (#764)
Crawl-delay handling: allow `fetcher.max.crawl.delay` exceed 300 sec.(#768)
okhttp protocol: HTTP request header lacks protocol name and version (#775)
Locking mechanism for Metadata objects (#781)

LangID

/bugfix/ langID parse filter gets stuck (#758)

Elasticsearch

/bugfix/ Fix NullPointerException in JSONResourceWrappers (#760)
ES specify field used for grouping the URLs explicitly in mapping (#761)
Use search after for pagination in HybridSpout (#762)
Filter queries in ES can be defined as lists (#765)
es.status.bucket.sort.field can take a list of values (#766)
Archetype for SC+Elasticsearch (#773)
ES merge seed injection into crawl topology (#778)
Kibana - change format of templates to ndjson (#780)
/bugfix/ HybridSpout get key for results when prefixed by "metadata." (#782)
AggregationSpout to store sortValues for the last result of each bucket (#783)
Import Kibana dashboards using the API (#785)
Include Kibana script and resources in ES archetype (#786)

One of the main improvements in 1.16 is the addition of a Maven archetype to generate a crawl topology using Elasticsearch as a backend (#773). This is done by calling

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-elasticsearch-archetype -DarchetypeVersion=LATEST

The generated project also contains a script and resources to load templates into Kibana.

The topology for Elasticsearch now includes the injection of seeds from a file, which was previously in a separate topology. These changes should help beginners get started with StormCrawler.

The previous release included URLBuffers, with just one simple implementation. Two new implementations have been added in #764. The brand new PriorityURLBuffer sorts the buckets by the number of acks they got since the last sort whereas the SchedulingURLBuffer tries to guess when a queue should release a URL based on how long it took its previous URLs to be acked on average. The former has been used extensively with the HybridSpout but the latter is still experimental.

Finally, we added a soft locking mechanism to Metadata (#781) to help trace the source of ConcurrentModificationExceptions. If you are experiencing such exceptions, calling metadata.lock() when emitting e.g.

collector.emit(StatusStreamName, tuple, new Values(url, metadata.lock(), Status.FETCHED))

will trigger an exception whenever the metadata object is modified somewhere else. You might need to call unlock() in the subsequent bolts.

This does not change the way the Metadata works but is just there to help you debug.

Hopefully, we should be able to release 2.0 in the next few months. In the meantime, happy crawling and a massive thank you to all contributors!

Thursday, 19 September 2019

What's new in StormCrawler 1.15?

StormCrawler 1.15 was released yesterday and as usual, contains loads of improvements and bugfixes.

You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/25?closed=1

We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.

Dependency upgrades

Storm 1.2.3 (#743)
JSOUP 1.12.1 (#741)
ES 7.3.0 (#742)
Tika 1.22 (#726)

Core

/bugfix/ CharsetIdentification crashes on binary content (#747)
FetcherBolt skips tuples which have spent too much time in queues (#746)
Fetcher bolts generate metrics for HTTP status (#745)
improvements to URLFilterBolt (#740)
/bugfix/ FetcherBolt doesn't recover when entering maxNumberURLsInQueues (#738)
/bugfix/ RemoteDriverProtocol does not set user agent correctly (#735)
Force English Locale for SimpleDateFormat in cookie converter (#732)

LangID

LangId normalises and returns value found via extraction (#733)

Elasticsearch

Pluggable URLBuffer and Hybrid Elasticsearch spout (#752)

ES spouts control how long the search is allowed to take with timeout (#753)

Improve types used for numeric values for metrics mappings (#744)

Use sniffer for ES connections (#734)

ScrollSpout to quit logging when finished (#727)

ES spouts use nextFetchDate RangeQuery as a filter (#725)

MetricsConsumer takes an optional date format (#724)

StatusMetricsBolt returns a max of 10K results per status (#723)

Happy crawling and thanks to all contributors!

Monday, 13 May 2019

What's new in StormCrawler 1.14

StormCrawler 1.14 was released yesterday and as usual, contains loads of improvements and bugfixes.

You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/24?closed=1

This release contains a number of breaking changes, mostly related to the move to Elasticsearch 7. We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.

Dependency upgrades

crawler-commons 1.0 #693
okhttp 3.14.0 #692
guava 27.1 (#702)
icu4j 64.1 #702)
httpclient 4.5.8 #702)
Snakeyaml 1.24 #702)
wiremock 2.22.0 #702)
rometools 1.12.0 #702)
Elasticsearch 7.0.0 (#708)

Core

Track how long a spout has been without any URLs in its buffer (#685)
Change ack mechanism for StatusUpdaterBolts (#689)
Robots URL filter to get instructions from cache only (#700)
Allow indexing under canonical URL if in the same domain, not just host (#703)
/bugfix/ URLs ending with a space are fetched over and over again (#704)
ParseFilter to normalise the mime-type of documents into simple values (#707)
Robot rules should check the cache in case of a redirection (#709)
/bugfix/ Fix the logic around sitemap = false (#710)
Reduce logging of exceptions in FetcherBolt (#719)

Elasticsearch

Asynchronous spouts (i.e ES) can send queries after max delay since previous one ended (#683)
StatusUpdaterBolt to load config from non-default param names (#687)
Add a ScrollSpout to read all the documents from a shard (#688 and #690) - see in our guest post how this can be used to reindex a status index.
ES IndexerBolt : check success of batches before acking tuples (#647)
/bugfix/ URLs with content that breaks ES get refetched over and over again (#705)
/bugfix/ URLs without valid host name (and routing) stay DISCOVERED forever (#706)
/bugfix/ ESSeedInjector: no URLs injected because URL filter does not subscribe to status stream (#715)
MetricsConsumer to include topology ID in metrics(#714)

WARC

Generate WARC request records (#509)
WARC format improvements (#691)

Tika

Set mimetype whitelist for Tika Parser (#712)

*********

I will be running a workshop on StormCrawler next month at the Web Archiving Conference in Zagreb and give a presentation jointly with Sebastian Nagel of CommonCrawl. I will come with loads of presents generously given by our friends at Elastic.

As usual, thanks to all contributors and users.

Happy crawling!

Monday, 11 February 2019

Meet StormCrawler users: Q&A with Pixray (Germany)

We are opening a series of Q&A blogs with Maik Piel telling us about the use of StormCrawler at Pixray.

Q: What do you guys do at Pixray? Why do you need web crawling?

We are experts in image tracking on the web. We work for image rights holders to protect their pictures on the web as well as brands and manufacturers to monitor sales channels. Our customers range from news agencies and picture agencies, individual photographers, e-commerce companies to luxury brands. Web crawling is one of the core buildings blocks of our platform - next to a massive picture matching platform, various APIs and our customer portals.

Q: What sort of crawls do you do? How big are they?

We do three kinds of scans: broad scans across complete regions of the web (like the EU or North America), deep scans on single domains and also near-realtime discovery scans on thousands of selected domains. For all of these different scans, we employ customized versions of StormCrawler to match the very distinct requirements in crawling patterns. Obviously, the biggest crawls are the broad regional scans, including more than 10 billion URLs and tens of millions of different domains.

Q: What software stack do you use? e.g. SC + ES + Grafana? Hardware used?

Adapted and extended versions of StormCrawler as well as Elasticsearch and Kibana. We couple our crawling infrastructure with the rest of our platform through RabbitMQ. Our crawler is built on Ubuntu servers, with 32 GB of RAM and Intel Core I7 and 4 TB of disk space. Each runs Apache Storm and Elasticsearch. In the future, we will split the storage (Elasticsearch) and the computation (Storm) layers to separate hardware. We are also looking at options to employ container and service orchestration frameworks to scale our crawler infrastructure dynamically.

Q: Why did you choose StormCrawler?

We initially built our crawler on Apache Nutch. Needless to say that Nutch is a great and robust platform. But once you grow beyond a certain point you start to see limitations. The biggest limitation is the low responsiveness to changes and the uneven system utilization due to the long generate/crawl/update cycles. It sometimes took us 24 hours or more till we could see the effects of a change we made to the software. Furthermore, we found that it is a bit troublesome to get valid statistics data from Nutch in real time. StormCrawler solves all that for us. Every config or code change that we commit shows its effect immediately and you get statistics very, very easily. There is no long-cycle batching anymore in StormCrawler which gives us a very even and continuous crawling, reducing our need for massive queuing of results to ensure an even utilization of down-stream infrastructure. Kibana gives us great real-time insights into the crawl database. With Nutch, we had to run analysis jobs of around 4 hours, even if we just needed the status of a single url.

Q: What do you like the most / least in StormCrawler?

Besides the points mentioned above, we have to praise StormCrawlers extensibility. In our different setups we have both made changes to existing code in the StormCrawler project but also wrote large amounts of own code. The structure Apache Storm imposes is great. Components are very cleanly decoupled and it is easy to introduce custom functionality by just writing new Spouts and Bolts and linking them into the topology. For our use case we, of course, had to deal with pictures - which StormCrawler itself does not do. We just created our own Bolts for that. For our near-realtime discovery crawler, we needed an engine that calculates the revisit date for a URL based on various factors instead of a static value, again we could just create a specific spout for that.

Q: Anything in particular you'd like to have in a future release?

It would be great to have a built-in way to prioritize different TLDs within the StormCrawler spouts. We have built a custom solution for that which we might contribute back to StormCrawler at some point.

Sunday, 6 January 2019

What's new in StormCrawler 1.13

Happy new year!

I have just released StormCrawler 1.13, which contains important bug fixes and some nice improvements.

As usual, we advise users to upgrade to this version.

Dependency upgrades

Tika 1.20 (#676)

Xerces 2.12.0 (#672)

Guava 27.0.1 (#672)

Elasticsearch 6.5.3 (#672)

Jackson 2.8.11.3 (14e44)

Core

FileSpout uses StringTabScheme by default (#664)

JSoupParserBolt outlink limit per page (#670)

/BUGFIX/ Date format used for HTTP if-modified-since requests must follow RFC7231 (#674)

/BUGFIX/ DeletionBolt expects Metadata from tuples (#675)

Added configurable TextExtractor to JSoupParserBolt (#678)

!BREAKING! Core Spouts should use status stream if withDiscoveredStatus is set to true (#677)

SQL

SQL IndexerBolt (#608)

Archetype

Archetype sets StormCrawler version in a property (#668)

Replace ContentFilter with TextExtractor (#678)

Apart from the changes to the core spouts (#664 and #677), the main new feature is the addition of the TextExtractor (#678) for the JsoupParserBolt. Unlike the ContentParseFilter, which it replaces, it is configured from the main configuration and is not a ParseFilter as it operates directly on the objects generated by Jsoup. The TextExtractor allows restricting the text to specific elements to avoid boilerplate code and navigation elements but provides a far cleaner text content compared to the ContentParseFilter which merges some tokens. The TextExtractor can also be used to define exclusion zones which will be applied either to the restricted zones or the whole document if no such zone were defined or found. This is useful for instance to remove SCRIPT or STYLE elements.

As usual, thanks to all contributors and users, and particularly the Government of Northwest Territories in Canada who kindly donated some of the code of the TextExtractor.

Happy crawling!

Thursday, 22 November 2018

What's new in StormCrawler 1.12

The previous release was only last month but I decided to ship this one now as it contains several bugfixes and improvements which many users would benefit from.

As you can see below, the main changes are around protocols and sitemaps. We have used Selenium and OKHTTP a lot recently to deal with dynamic websites and the changes below definitely help for these. There is also an important bugfix for JSOUP (#653) and various other improvements.

As usual, we advise users to upgrade to this version.

Dependency upgrades

JSOUP 1.11.3 (#663)
Elasticsearch 6.5.0 (#661)
Jackson and Wiremock dependencies (#640)

Core

Post JSON data with OKHTTP protocol via metadata (#641)
Selenium RemoteDriverProtocol triggered by K/V in metadata (#642)
SeleniumProtocol NavigationFilters not reached in case of a redirection (#643)
Limit crawl to URLs found in sitemaps (#645)
spout.reset.fetchdate.after based on time when query was set to NOW (#648)
Avoid StackOverflowError when generating DocumentFragment from JSOUP (#653)
redirected sitemaps don't have isSitemap=true (#660)
Staggered scheduling of sitemap URLs (#657)
Scheduling -> round to the closest second, minute or hour (#654)
FetcherBolt don't add discovered sitemaps if the robots rules do not allow them (#662)

WARC

WARC record format: trailing zero byte causes WARC parser to fail (#652)

Elasticsearch

ES IndexerBolt track number of batch sent (#540)
Rename index index into docs (#649)
ES StatusMetricsBolt generate metrics for total number of docs (#651)

Coming next...

The release of Storm 2.0.0 has taken longer than expected, which is partly my fault as I reported a number of issues. These issues have now been fixed and hopefully, 2.0.0 will be out soon. As mentioned last month, there's a branch of StormCrawler which works on the Storm 2.x branch. Give it a try if you want to be on the cutting edge!

Finally, there will be a StormCrawler workshop in Vilnius next week. I am sure tickets are still available if you fancy a last minute trip to Lithuania.

As usual, thanks to all contributors and users. Happy crawling!

UPDATE

There were 2 bugs in release 1.12 which have been fixed in 1.12.1, see details on

https://github.com/DigitalPebble/storm-crawler/milestone/23?closed=1

Thursday, 23 November 2023

Can you please introduce yourselves and the project you are working on?

How do you use StormCrawler and URLFrontier?

What results did you get so far?

Why did you choose StormCrawler?

Did you make any contributions to it? Any advice you could give to future users and contributors?

What's next?

Tuesday, 11 January 2022

Wednesday, 5 May 2021

Monday, 20 July 2020

Dependency upgrades

Core

WARC

Elasticsearch

Thursday, 16 January 2020

Dependency upgrades

Core

LangID

Elasticsearch

Thursday, 19 September 2019

Dependency upgrades

Core

LangID

Elasticsearch

Monday, 13 May 2019

Dependency upgrades

Core

Elasticsearch

WARC

Tika

Monday, 11 February 2019

Sunday, 6 January 2019

Dependency upgrades

Core

SQL

Archetype

Thursday, 22 November 2018

Dependency upgrades

Core

WARC

Elasticsearch

Coming next...

UPDATE