DigitalPebble's Blog: course

Wednesday, 29 March 2017

Full day workshop(s) on StormCrawler (+Elasticsearch and Kibana)

I will be running a full-day workshop on crawling with StormCrawler on the 24th April in Berlin. See full details on https://endoctus.com/course/web-crawling-with-stormcrawler.

Please find the program below:

In this workshop, we will explore StormCrawler a collection of resources for building low-latency, large scale web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what Storm-Crawler provides, we'll put it to use straight away for a simple crawl before moving on to the deployed mode of Storm.

In the second part of the session, we will then introduce metrics and index documents with Elasticsearch and Kibana and dive into data extraction. Finally, we'll cover recursive crawls and scalability. This course will be hands-on: attendees will run the code on their own machines.

This course will suit Java developers with an interest in big data, stream processing, web crawling and search. It will provide a practical introduction to both Apache Storm and Elasticsearch as well of course as StormCrawler and should not require advanced programming skills.

Duration : 2x3 hours

PS: Do you follow DigitalPebble or StormCrawler on Twitter? Announcements and updates are made there (as well as all sorts of interesting news of course!)

Monday, 29 July 2013

Nutch training course

~~We are planning to run a 2-day training courses on Apache Nutch on the 24/25 October 2013. It will take place in Bristol, UK (the exact venue will be announced later).~~

The course has been put on hold for now. Please do get in touch if you are interested and I will keep you updated as soon as we reach a sufficient number of attendees.

The course will cover pretty much everything about Nutch from installation and configuration to writing custom resources and will cover both Nutch 1.x and 2.x. The students will learn about best practices for running and managing a Nutch crawl.

Attendees should have some knowledge of JAVA and be comfortable with command line tools to execute basic commands. Some understanding of Hadoop is a plus but not a strict requirement. The course will consist in some hands-on exercises : bring your laptop! Note that the demonstrations and exercises will be based on a Linux OS.

The program given here is an indication only and might change slightly. Feel free to suggest things that you'd like to learn during the course.

Day 1 : NUTCH BASICS

Basic setup
Compilation and dependencies
Main concepts and operational steps
Nutch data structures
Parsing
Indexing
Scoring
Best practices for development and in production

Day 2 : ADVANCED NUTCH

Plugin architecture
Politeness and performance
Metadata in Nutch
Advanced use cases
Introduction to Nutch 2.x

Please contact us on course@digitalpebble.com if you have a question or want to be kept informed of the next date for this course.