Showing posts with label course. Show all posts
Showing posts with label course. Show all posts

Wednesday, 29 March 2017

Full day workshop(s) on StormCrawler (+Elasticsearch and Kibana)


I will be running a full-day workshop on crawling with StormCrawler on the 24th April in Berlin. See full details on https://endoctus.com/course/web-crawling-with-stormcrawler.

Please find the program below:

In this workshop, we will explore StormCrawler a collection of resources for building low-latency, large scale web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what Storm-Crawler provides, we'll put it to use straight away for a simple crawl before moving on to the deployed mode of Storm

In the second part of the session, we will then introduce metrics and index documents with Elasticsearch and Kibana and dive into data extraction. Finally, we'll cover recursive crawls and scalability. This course will be hands-on: attendees will run the code on their own machines.  

This course will suit Java developers with an interest in big data, stream processing, web crawling and search. It will provide a practical introduction to both Apache Storm and Elasticsearch as well of course as StormCrawler and should not require advanced programming skills. 

Duration : 2x3 hours 


PS: Do you follow DigitalPebble or StormCrawler on Twitter? Announcements and updates are made there (as well as all sorts of interesting news of course!) 

Monday, 29 July 2013

Nutch training course

We are planning to run a 2-day training courses on Apache Nutch on the 24/25 October 2013. It will take place in Bristol, UK (the exact venue will be announced later). 

The course has been put on hold for now. Please do get in touch if you are interested and I will keep you updated as soon as we reach a sufficient number of attendees.

The course will cover pretty much everything about Nutch from installation and configuration to writing custom resources and will cover both Nutch 1.x and 2.x. The students will learn about best practices for running and managing a Nutch crawl. 

Attendees should have some knowledge of JAVA and be comfortable with command line tools to execute basic commands. Some understanding of Hadoop is a plus but not a strict requirement. The course will consist in some hands-on exercises : bring your laptop! Note that the demonstrations and exercises will be based on a Linux OS.

The program given here is an indication only and might change slightly. Feel free to suggest things that you'd like to learn during the course. 

Day 1 : NUTCH BASICS

  • Basic setup
  • Compilation and dependencies
  • Main concepts and operational steps
  • Nutch data structures
  • Parsing
  • Indexing
  • Scoring
  • Best practices for development and in production 

Day 2 : ADVANCED NUTCH

  • Plugin architecture
  • Politeness and performance
  • Metadata in Nutch
  • Advanced use cases
  • Introduction to Nutch 2.x

Please contact us on course@digitalpebble.com if you have a question or want to be kept informed of the next date for this course.