The course will cover pretty much everything about Nutch from installation and configuration to writing custom resources and will cover both Nutch 1.x and 2.x. The students will learn about best practices for running and managing a Nutch crawl.
Attendees should have some knowledge of JAVA and be comfortable with command line tools to execute basic commands. Some understanding of Hadoop is a plus but not a strict requirement. The course will consist in some hands-on exercises : bring your laptop! Note that the demonstrations and exercises will be based on a Linux OS.
The program given here is an indication only and might change slightly. Feel free to suggest things that you'd like to learn during the course.
Day 1 : NUTCH BASICS
- Basic setup
- Compilation and dependencies
- Main concepts and operational steps
- Nutch data structures
- Parsing
- Indexing
- Scoring
- Best practices for development and in production
Day 2 : ADVANCED NUTCH
- Plugin architecture
- Politeness and performance
- Metadata in Nutch
- Advanced use cases
- Introduction to Nutch 2.x