DigitalPebble's Blog: text classification

Showing posts with label text classification. Show all posts

Saturday, 19 March 2011

DigitalPebble is hiring!

We are looking for a candidate with the following skills and expertise :

strong background in NLP and Java
GATE, experience of writing plugins and PRs, excellent knowledge of JAPE
IE, Linked Data, Ontologies
statistical approaches and machine learning
large scale computing with Hadoop
knowledge of the following technologies / tools : Lucene, SOLR, NoSQL, Tika, UIMA, Mahout
good social and presentation skills
good spoken and written English, knowledge of other languages would be a plus
taste for challenges and problem solving

    DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering.

    More details on our activities can be found on our website. We would consider candidates working remotely with occasional travel to Bristol and our clients in UK and Europe. Being located in or near Bristol would be a plus.

    This job is an opportunity to get involved in the growth of a small company, work on interesting projects and take part in various Apache related projects and events. Bristol is also a great place to live.

   Please send your CV and cover letter before the 15th April 2011 to job@digitalpebble.com

    Best regards,

    Julien Nioche

Wednesday, 27 October 2010

TextClassification plugin for GATE

Just to let you know that I've made public our TextClassification plugin for GATE. As its name indicates, it is a GATE plugin which uses
our TextClassification API for building a training corpus from
GATE documents or classify GATE docs with an existing model.

This code
has been around for some time but I did not find the time to make it
public until now. Since I regularly get emails asking me about it, I
thought it would be simpler to release it.

As usual, it is under the Apache Software License and will (maybe) get
a better documentation soon. For those of you familiar with the TC
API, this should be quite straightforward.

[1] http://github.com/jnioche/TextClassificationPlugin
[2] https://code.google.com/p/textclassification/

Friday, 3 September 2010

Field-based Weighting Schemes for Text Classification

Our Text Classification API uses a representation of documents based on fields, a bit like in Lucene.

This is quite useful as it allows to differentiate the terms based on the field they are found in and treat them as different attributes (e.g. text_every, title_title, ...) and of course to take the length of a field into account when computing the weight of a term.

In the example above, the attribute text_every would get a score of 0.2 (1/5) if the term frequency was used as a weighting scheme as the field contains a total of 5 tokens. Without the field-based representation of the documents, we'd have an attribute every (note that text_ has gone) with a score of 0.0714 (1 / 14).

Having this is great as it gives us more options for modelling the content of a document. Intuitively, we know that a term found in the title of a web page or in its description has a different status than in the main text of the page. This does not mean that it would necessarily have a higher weight, as this is determined by the ML algorithm, but at least the algorithm has the possibility to treat such a term differently.

We recently pushed the logic one step further. Since we use the length of the fields in order to compute the weights of the terms, at least for the tfidf and frequency weighting schemes, we thought it could be interesting to specify the weighting scheme per field instead of using the same scheme for all the fields. For instance, using frequency or tfidf for the main content of a document makes sense, but we wouldn't want to penalize a term in the title or keywords fields because of their length : whether a term occurs once among 10 keywords is just as good as it was on its own. We can now specify that the field title must use e.g. the boolean weighting scheme but keep as the default for the other fields.

I ran a quick experiment on the dataset we use to classify pages as adult or not in Nutch. The label is binary and indicates whether a page is suitable for all types of public or not. A model is built from this dataset and used in a custom Nutch ParseFilter and IndexingFilter so that we can e.g. use a filter query in SOLR to restrict the search result to 'safe' pages.

I tried 3 different versions of the dataset :
[A] all the fields (content title description keywords) use the frequency scheme
[B] all the fields use the tfidf scheme
[C] the frequency scheme is used by default but the field content uses tfidf

and got the following results with the K-fold cross validation provided by libLinear :

A=97.3564%
B=96.6150%
C=97.5131%

Interestingly, the best results were obtained by using a different scheme for the content. Using tfidf for all the fields gave the worst results.

It would be interesting to try and compare this with using a single field for the content and none of the other fields. There are a lot of other experiments that could be made but at least we now have the possibility to do it with the API.