Saturday, 28 August 2010

Behemoth talk from BerlinBuzzwords 2010

The talk I gave on Behemoth at BerlinBuzzwords has been filmed (I do not dare watching it) and is available on http://blip.tv/file/3809855.

The slides can be found on http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/nioche_bbuzz2010.odp

The talk contains a quick demo of GATE and mentions Tika, UIMA and of course Hadoop.

Friday, 27 August 2010

Tom White on Hadoop 0.21

An excellent summary from Tom White on the release 0.21 of Hadoop

http://www.cloudera.com/blog/2010/08/what%e2%80%99s-new-in-apache-hadoop-0-21/

Having the distributed cache and parallel mappers with the LocalJobRunner is very good news for Behemoth as we need it to distribute the resources to all the nodes. This should make it easier to test in local mode.

Thursday, 26 August 2010

Using Payloads with DisMaxQParser in SOLR

Payloads are a good way of controlling the scores in SOLR/Lucene.

This post by Grant Ingersoll gives a good introduction to payloads, I also found http://www.ultramagnus.org/?p=1 pretty useful. 

What I will describe here is how to use the payloads and have the functionalities of the DisMaxQParser in SOLR.

SOLR already has a field type for analysing payloads 

<fieldtype name="payloads" stored="false" indexed="true" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!--
The DelimitedPayloadTokenFilter can put payloads on tokens... for example,
a token of "foo|1.4" would be indexed as "foo" with a payload of 1.4f
Attributes of the DelimitedPayloadTokenFilterFactory :
"delimiter" - a one character delimiter. Default is | (pipe)
"encoder" - how to encode the following value into a playload
float -> org.apache.lucene.analysis.payloads.FloatEncoder,
integer -> o.a.l.a.p.IntegerEncoder
identity -> o.a.l.a.p.IdentityEncoder
Fully Qualified class name implementing PayloadEncoder, Encoder must have a no arg constructor.
-->
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
</analyzer>
</fieldtype>
view raw PayloadAnalyzer hosted with ❤ by GitHub



and we can also define a custom Similarity to use with the payloads

package com.digitalpebble.solr;
import org.apache.lucene.analysis.payloads.PayloadHelper;
import org.apache.lucene.search.DefaultSimilarity;
public class PayloadSimilarity extends DefaultSimilarity
{
@Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length)
{
if (length > 0) {
return PayloadHelper.decodeFloat(payload, offset);
}
return 1.0f;
}
}


 
then specify this in the SOLR schema

<!-- schema.xml -->
<similarity class="uk.org.company.solr.PayloadSimilarity" />


 
So far so good. We now need a QueryParser plugin in order to use the payloads in the search and as mentioned above, I want to keep the functionalities of the DisMaxQueryParser.
The problem is that we need to specify PayloadTermQuery objects instead of TermQueries which is down deep in the object hierarchies and cannot AFAIK be modified simply from DismaxQueryParser.
I have implemented a modified version of DismaxQueryParser which rewrites the main part of the query (a.k.a userQuery in the implementation) and substitutes the TermQueries with PayloadTermQueries.
First we'll create a QParserPlugin 

package com.digitalpebble.solr;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.QParser;
import org.apache.solr.search.QParserPlugin;
public class PLDisMaxQParserPlugin extends QParserPlugin {
public void init(NamedList args) {
}
@Override
public QParser createParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {
return new PLDisMaxQParser(qstr, localParams, params, req);
}
}
view raw gistfile1.java hosted with ❤ by GitHub



which does not do much but simply exposes the PLDisMaxQueryParser which is a modified version of the standard DisMaxQueryParser but with PayloadQuery objects.


package com.digitalpebble.solr;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.DisjunctionMaxQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.payloads.MaxPayloadFunction;
import org.apache.lucene.search.payloads.PayloadFunction;
import org.apache.lucene.search.payloads.PayloadNearQuery;
import org.apache.lucene.search.payloads.PayloadTermQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.solr.common.params.DisMaxParams;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.DisMaxQParser;
import org.apache.solr.util.SolrPluginUtils;
/**
* Modified query parser for dismax queries which uses payloads
*/
public class PLDisMaxQParser extends DisMaxQParser {
public static final String PAYLOAD_FIELDS_PARAM_NAME = "plf";
public PLDisMaxQParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {
super(qstr, localParams, params, req);
}
protected HashSet<String> payloadFields = new HashSet<String>();
private static final PayloadFunction func = new MaxPayloadFunction();
float tiebreaker = 0f;
protected void addMainQuery(BooleanQuery query, SolrParams solrParams)
throws ParseException {
Map<String, Float> phraseFields = SolrPluginUtils
.parseFieldBoosts(solrParams.getParams(DisMaxParams.PF));
tiebreaker = solrParams.getFloat(DisMaxParams.TIE, 0.0f);
// get the comma separated list of fields used for payload
String[] plfarray = solrParams.get(PAYLOAD_FIELDS_PARAM_NAME, "")
.split(",");
for (String plf : plfarray)
payloadFields.add(plf.trim());
/*
* a parser for dealing with user input, which will convert things to
* DisjunctionMaxQueries
*/
SolrPluginUtils.DisjunctionMaxQueryParser up = getParser(queryFields,
DisMaxParams.QS, solrParams, tiebreaker);
/* for parsing sloppy phrases using DisjunctionMaxQueries */
SolrPluginUtils.DisjunctionMaxQueryParser pp = getParser(phraseFields,
DisMaxParams.PS, solrParams, tiebreaker);
/* * * Main User Query * * */
parsedUserQuery = null;
String userQuery = getString();
altUserQuery = null;
if (userQuery == null || userQuery.trim().length() < 1) {
// If no query is specified, we may have an alternate
altUserQuery = getAlternateUserQuery(solrParams);
query.add(altUserQuery, BooleanClause.Occur.MUST);
} else {
// There is a valid query string
userQuery = SolrPluginUtils.partialEscape(
SolrPluginUtils.stripUnbalancedQuotes(userQuery))
.toString();
userQuery = SolrPluginUtils.stripIllegalOperators(userQuery)
.toString();
parsedUserQuery = getUserQuery(userQuery, up, solrParams);
// recursively rewrite the elements of the query
Query payloadedUserQuery = rewriteQueriesAsPLQueries(parsedUserQuery);
query.add(payloadedUserQuery, BooleanClause.Occur.MUST);
Query phrase = getPhraseQuery(userQuery, pp);
if (null != phrase) {
query.add(phrase, BooleanClause.Occur.SHOULD);
}
}
}
/** Substitutes original query objects with payload ones **/
private Query rewriteQueriesAsPLQueries(Query input) {
Query output = input;
// rewrite TermQueries
if (input instanceof TermQuery) {
Term term = ((TermQuery) input).getTerm();
// check that this is done on a field that has payloads
if (payloadFields.contains(term.field()) == false)
return input;
output = new PayloadTermQuery(term, func);
}
// rewrite PhraseQueries
else if (input instanceof PhraseQuery) {
PhraseQuery pin = (PhraseQuery) input;
Term[] terms = pin.getTerms();
int slop = pin.getSlop();
boolean inorder = false;
// check that this is done on a field that has payloads
if (terms.length > 0
&& payloadFields.contains(terms[0].field()) == false)
return input;
SpanQuery[] clauses = new SpanQuery[terms.length];
// phrase queries : keep the default function i.e. average
for (int i = 0; i < terms.length; i++)
clauses[i] = new PayloadTermQuery(terms[i], func);
output = new PayloadNearQuery(clauses, slop, inorder);
}
// recursively rewrite DJMQs
else if (input instanceof DisjunctionMaxQuery) {
DisjunctionMaxQuery s = ((DisjunctionMaxQuery) input);
DisjunctionMaxQuery t = new DisjunctionMaxQuery(tiebreaker);
Iterator<Query> disjunctsiterator = s.iterator();
while (disjunctsiterator.hasNext()) {
Query rewrittenQuery = rewriteQueriesAsPLQueries(disjunctsiterator
.next());
t.add(rewrittenQuery);
}
output = t;
}
// recursively rewrite BooleanQueries
else if (input instanceof BooleanQuery) {
for (BooleanClause clause : (List<BooleanClause>) ((BooleanQuery) input)
.clauses()) {
Query rewrittenQuery = rewriteQueriesAsPLQueries(clause
.getQuery());
clause.setQuery(rewrittenQuery);
}
}
output.setBoost(input.getBoost());
return output;
}
public void addDebugInfo(NamedList<Object> debugInfo) {
super.addDebugInfo(debugInfo);
if (this.payloadFields.size() > 0) {
Iterator<String> iter = this.payloadFields.iterator();
while (iter.hasNext())
debugInfo.add("payloadField", iter.next());
}
}
}


Once these 3 classes have been compiled, jarred and put in the classpath of SOLR, we must add 

<queryParser name="payload" class="com.digitalpebble.solr.PLDisMaxQParserPlugin" />
view raw gistfile1.xml hosted with ❤ by GitHub


 to solrconfig.xml.
 
then specify for the requestHandler : 
 
<str name="defType">payload</str>
 
<!-- plf : comma separated list of field names --> 
 <str name="plf">
  payloads
 </str>
 
The fields listed in the parameter plf will be queried with Payload query objects.  Remember that you can use &debugQuery=true to get the details of the scores and check that the payloads are being used.
 
 

Thursday, 19 August 2010

Tika on FeatherCast

Apache Tika recently split off from the Lucene project and became a separate top level Apache project. Chris Mattmann is talking about what Tika is, and where it’s going on http://feathercast.org/?p=90

Friday, 13 August 2010

Towards Nutch 2.0

Nevermind the dodgy look of the blog - I'll improve that later!

For my first post, I'd like to mention the progress we've made recently towards Apache Nutch 2.0. It is based on a branch named NutchBase which has been developed mainly by Doğacan Güney and it is now in the trunk of the SVN repository. One of the main aspects of Nutch 2.0 is that it is now storing its data in a datastore and not in Hadoop's file-based structures. Note that we still have the distribution and replication of the data over a whole cluster and data locality for MapReduce but we also have the possibility to insert or modify a random entry in the table without having to read/write the whole data structure as it was the case before.

Nutch uses a project named GORA as an intermediate between our code and the backend storage. There would be a lot of things to say on GORA but to make it short what we are trying to achieve with it is to make it a sort of common API for NoSQL stores. GORA already has implementations for HBase and Cassandra but also SQL. The plan for GORA is to put it in the Apache Incubator or possibly as an Apache subproject (Hadoop? HBase? Cassandra?). We'll see how it goes.

There are quite a few structural changes in Nutch, most notably the fact that there aren't any segments any more as all the information about a URL (metadata, original content, extracted text, ...) in stored in a single table which means for instance no more segments to merge or metadata to move back to the crawldb. It's all in one place!

There are other substantial changes in 2.0, notably the removal of the Lucene-based indexing and search as we now rely on SOLR. Other indexing backends might be added later. Another step towards delegating functionalities to external projects is the increased used of Apache Tika for the parsing. We've removed quite a few legacy parsers from Nutch and let Tika do the work for us. We've also revamped the organisation of the code and did a lot of code clean up.

Nutch 2.0 is still at an early stage and we are actively working on it, testing, debugging etc... The good news is that it is not only an architectural change but also a basis for a whole lot of new functionalities (see for instance https://issues.apache.org/jira/browse/NUTCH-882).

I'll keep you posted on our progress, as usual : give it a try, get involved, join us...