It's always nice to see clients emerging of stealth mode and showing the fruits of their labour to the public. Our friends at http://www.similarpages.com have just done so and I am doubly pleased as this also reflect the work that DigitalPebble did for them.
SimilarPages is an add-on for Firefox which allows you to discover pages which are similar to the one you are currently reading. It's pretty cool and surprisingly easy to use. It's also completely free which is always nice.
From a technical point of view, we helped SimilarPages adapting Nutch to their needs and deploying it on a 400-nodes cluster on Amazon EC2 to crawl the web. We fetched and parsed a total of more than 3 billion pages from which we obtained 200+ million lists of similarities. The crawlDB itself contained more than 10 billion URLs.
Operating at such a scale is definitely challenging and has been a great experience. From a Nutch point of view, quite a few improvements and bugfixes in Nutch 1.1 come directly from the work done at SimilarPages, so thanks for that guys!
The SimilarPages use case is actually a good example of using Nutch as a crawling platform only, i.e. not indexing the documents with Lucene or SOLR. See A. BiaĆecki's presentation at BerlinBuzzwords 2010 for more examples on this subject.
The computation of the similarities between URLs from the Nutch crawls is made using bespoke Hadoop Map-Reduce code. More details on their approach can be found on SimilarPages' website.
I feel quite proud to have contributed to this project and wish long live to SimilarPages. If you haven't done so, give it a try!
Monday 27 September 2010
Apache Nutch 1.2 released
[quoting the announcement by Chris Mattmann]
The Apache Nutch project is pleased to announce the release of Apache Nutch
1.2. The release contents have been pushed out to the main Apache release
site so the releases should be available as soon as the mirrors get the
syncs.
Apache Nutch, one of the six new Apache TLPs as a result of the April 2010
Board Meeting, is an extensible framework for building out large-scale
web-based search. Layered on top of fellow Apache projects Hadoop,
Lucene/Solr, and Tika, Nutch provides an out of the box platform for
fetching web pages, pdf files, word documents, and more. Nutch parses the
content and its relevant information, indexes its metadata, and makes it
available for efficient query and retrieval over modern Internet protocols.
Apache Nutch 1.2 contains a number of improvements and bug fixes. Details
can be found in the changes file:
http://www.apache.org/dist/nutch/CHANGES-1.2.txt
Apache Nutch is available in source and binary form from the following
download page: http://www.apache.org/dyn/closer.cgi/nutch/
In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the downloads
using signatures found on the Apache site:
http://www.apache.org/dist/nutch/KEYS-1.2.txt
For more information on Apache Nutch, visit the project home page:
http://nutch.apache.org
The Apache Nutch project is pleased to announce the release of Apache Nutch
1.2. The release contents have been pushed out to the main Apache release
site so the releases should be available as soon as the mirrors get the
syncs.
Apache Nutch, one of the six new Apache TLPs as a result of the April 2010
Board Meeting, is an extensible framework for building out large-scale
web-based search. Layered on top of fellow Apache projects Hadoop,
Lucene/Solr, and Tika, Nutch provides an out of the box platform for
fetching web pages, pdf files, word documents, and more. Nutch parses the
content and its relevant information, indexes its metadata, and makes it
available for efficient query and retrieval over modern Internet protocols.
Apache Nutch 1.2 contains a number of improvements and bug fixes. Details
can be found in the changes file:
http://www.apache.org/dist/nutch/CHANGES-1.2.txt
Apache Nutch is available in source and binary form from the following
download page: http://www.apache.org/dyn/closer.cgi/nutch/
In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the downloads
using signatures found on the Apache site:
http://www.apache.org/dist/nutch/KEYS-1.2.txt
For more information on Apache Nutch, visit the project home page:
http://nutch.apache.org
Labels:
nutch
Friday 3 September 2010
Field-based Weighting Schemes for Text Classification
Our Text Classification API uses a representation of documents based on fields, a bit like in Lucene.
This is quite useful as it allows to differentiate the terms based on the field they are found in and treat them as different attributes (e.g. text_every, title_title, ...) and of course to take the length of a field into account when computing the weight of a term.
In the example above, the attribute text_every would get a score of 0.2 (1/5) if the term frequency was used as a weighting scheme as the field contains a total of 5 tokens. Without the field-based representation of the documents, we'd have an attribute every (note that text_ has gone) with a score of 0.0714 (1 / 14).
Having this is great as it gives us more options for modelling the content of a document. Intuitively, we know that a term found in the title of a web page or in its description has a different status than in the main text of the page. This does not mean that it would necessarily have a higher weight, as this is determined by the ML algorithm, but at least the algorithm has the possibility to treat such a term differently.
We recently pushed the logic one step further. Since we use the length of the fields in order to compute the weights of the terms, at least for the tfidf and frequency weighting schemes, we thought it could be interesting to specify the weighting scheme per field instead of using the same scheme for all the fields. For instance, using frequency or tfidf for the main content of a document makes sense, but we wouldn't want to penalize a term in the title or keywords fields because of their length : whether a term occurs once among 10 keywords is just as good as it was on its own. We can now specify that the field title must use e.g. the boolean weighting scheme but keep as the default for the other fields.
I ran a quick experiment on the dataset we use to classify pages as adult or not in Nutch. The label is binary and indicates whether a page is suitable for all types of public or not. A model is built from this dataset and used in a custom Nutch ParseFilter and IndexingFilter so that we can e.g. use a filter query in SOLR to restrict the search result to 'safe' pages.
I tried 3 different versions of the dataset :
[A] all the fields (content title description keywords) use the frequency scheme
[B] all the fields use the tfidf scheme
[C] the frequency scheme is used by default but the field content uses tfidf
and got the following results with the K-fold cross validation provided by libLinear :
A=97.3564%
B=96.6150%
C=97.5131%
Interestingly, the best results were obtained by using a different scheme for the content. Using tfidf for all the fields gave the worst results.
It would be interesting to try and compare this with using a single field for the content and none of the other fields. There are a lot of other experiments that could be made but at least we now have the possibility to do it with the API.
This is quite useful as it allows to differentiate the terms based on the field they are found in and treat them as different attributes (e.g. text_every, title_title, ...) and of course to take the length of a field into account when computing the weight of a term.
In the example above, the attribute text_every would get a score of 0.2 (1/5) if the term frequency was used as a weighting scheme as the field contains a total of 5 tokens. Without the field-based representation of the documents, we'd have an attribute every (note that text_ has gone) with a score of 0.0714 (1 / 14).
Having this is great as it gives us more options for modelling the content of a document. Intuitively, we know that a term found in the title of a web page or in its description has a different status than in the main text of the page. This does not mean that it would necessarily have a higher weight, as this is determined by the ML algorithm, but at least the algorithm has the possibility to treat such a term differently.
We recently pushed the logic one step further. Since we use the length of the fields in order to compute the weights of the terms, at least for the tfidf and frequency weighting schemes, we thought it could be interesting to specify the weighting scheme per field instead of using the same scheme for all the fields. For instance, using frequency or tfidf for the main content of a document makes sense, but we wouldn't want to penalize a term in the title or keywords fields because of their length : whether a term occurs once among 10 keywords is just as good as it was on its own. We can now specify that the field title must use e.g. the boolean weighting scheme but keep
I ran a quick experiment on the dataset we use to classify pages as adult or not in Nutch. The label is binary and indicates whether a page is suitable for all types of public or not. A model is built from this dataset and used in a custom Nutch ParseFilter and IndexingFilter so that we can e.g. use a filter query in SOLR to restrict the search result to 'safe' pages.
I tried 3 different versions of the dataset :
[A] all the fields (content title description keywords) use the frequency scheme
[B] all the fields use the tfidf scheme
[C] the frequency scheme is used by default but the field content uses tfidf
and got the following results with the K-fold cross validation provided by libLinear :
A=97.3564%
B=96.6150%
C=97.5131%
Interestingly, the best results were obtained by using a different scheme for the content. Using tfidf for all the fields gave the worst results.
It would be interesting to try and compare this with using a single field for the content and none of the other fields. There are a lot of other experiments that could be made but at least we now have the possibility to do it with the API.
Labels:
text classification
Saturday 28 August 2010
Behemoth talk from BerlinBuzzwords 2010
The talk I gave on Behemoth at BerlinBuzzwords has been filmed (I do not dare watching it) and is available on http://blip.tv/file/3809855.
The slides can be found on http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/nioche_bbuzz2010.odp
The talk contains a quick demo of GATE and mentions Tika, UIMA and of course Hadoop.
The slides can be found on http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/nioche_bbuzz2010.odp
The talk contains a quick demo of GATE and mentions Tika, UIMA and of course Hadoop.
Friday 27 August 2010
Tom White on Hadoop 0.21
An excellent summary from Tom White on the release 0.21 of Hadoop
http://www.cloudera.com/blog/2010/08/what%e2%80%99s-new-in-apache-hadoop-0-21/
Having the distributed cache and parallel mappers with the LocalJobRunner is very good news for Behemoth as we need it to distribute the resources to all the nodes. This should make it easier to test in local mode.
http://www.cloudera.com/blog/2010/08/what%e2%80%99s-new-in-apache-hadoop-0-21/
Having the distributed cache and parallel mappers with the LocalJobRunner is very good news for Behemoth as we need it to distribute the resources to all the nodes. This should make it easier to test in local mode.
Thursday 26 August 2010
Using Payloads with DisMaxQParser in SOLR
Payloads are a good way of controlling the scores in SOLR/Lucene.
This post by Grant Ingersoll gives a good introduction to payloads, I also found http://www.ultramagnus.org/?p=1 pretty useful.
What I will describe here is how to use the payloads and have the functionalities of the DisMaxQParser in SOLR.
SOLR already has a field type for analysing payloads
and we can also define a custom Similarity to use with the payloads
then specify this in the SOLR schema
<!-- schema.xml -->
<similarity class="uk.org.company.solr.PayloadSimilarity" />
<similarity class="uk.org.company.solr.PayloadSimilarity" />
So far so good. We now need a QueryParser plugin in order to use the payloads in the search and as mentioned above, I want to keep the functionalities of the DisMaxQueryParser.
The problem is that we need to specify PayloadTermQuery objects instead of TermQueries which is down deep in the object hierarchies and cannot AFAIK be modified simply from DismaxQueryParser.I have implemented a modified version of DismaxQueryParser which rewrites the main part of the query (a.k.a userQuery in the implementation) and substitutes the TermQueries with PayloadTermQueries.
First we'll create a QParserPlugin
which does not do much but simply exposes the PLDisMaxQueryParser which is a modified version of the standard DisMaxQueryParser but with PayloadQuery objects.
Once these 3 classes have been compiled, jarred and put in the classpath of SOLR, we must add
to solrconfig.xml.
then specify for the requestHandler :
<str name="defType">payload</str>
<!-- plf : comma separated list of field names --> <str name="plf"> payloads </str>
The fields listed in the parameter plf will be queried with Payload query objects. Remember that you can use &debugQuery=true to get the details of the scores and check that the payloads are being used.
Thursday 19 August 2010
Tika on FeatherCast
Apache Tika recently split off from the Lucene project and became a separate top level Apache project. Chris Mattmann is talking about what Tika is, and where it’s going on http://feathercast.org/?p=90
Labels:
tika
Subscribe to:
Posts (Atom)