The initial release of crawler-commons is available from : http://code.google.com/p/Doing the release was quite an interesting experience as I'd never done that before. This was the opportunity to have a closer look at ANT+Maven, how to publish artefacts and use Nexus etc... which I am sure will be useful at some point (Behemoth? GORA? Nutch?).
The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.
The current version contains resources for :
- parsing robots.txt
- parsing sitemaps
- URL analyzer which returns Top Level Domains
- a simple HttpFetcher
This release is available on Sonatype's OSS Nexus repository [https://oss.sonatype.org/
content/repositories/releases/ com/google/code/crawler- commons/] and should be available on Maven Central soon.
Please send your questions, comments or suggestions to http://groups.google.com/
Now that crawler-commons is released we can start using it from Nutch, Bixo [see https://issues.apache.org/jira/browse/NUTCH-1031].