The talk I gave on Behemoth at BerlinBuzzwords has been filmed (I do not dare watching it) and is available on http://blip.tv/file/3809855.
The slides can be found on http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/nioche_bbuzz2010.odp
The talk contains a quick demo of GATE and mentions Tika, UIMA and of course Hadoop.
Saturday, 28 August 2010
Friday, 27 August 2010
Tom White on Hadoop 0.21
An excellent summary from Tom White on the release 0.21 of Hadoop
http://www.cloudera.com/blog/2010/08/what%e2%80%99s-new-in-apache-hadoop-0-21/
Having the distributed cache and parallel mappers with the LocalJobRunner is very good news for Behemoth as we need it to distribute the resources to all the nodes. This should make it easier to test in local mode.
http://www.cloudera.com/blog/2010/08/what%e2%80%99s-new-in-apache-hadoop-0-21/
Having the distributed cache and parallel mappers with the LocalJobRunner is very good news for Behemoth as we need it to distribute the resources to all the nodes. This should make it easier to test in local mode.
Thursday, 26 August 2010
Using Payloads with DisMaxQParser in SOLR
Payloads are a good way of controlling the scores in SOLR/Lucene.
This post by Grant Ingersoll gives a good introduction to payloads, I also found http://www.ultramagnus.org/?p=1 pretty useful.
What I will describe here is how to use the payloads and have the functionalities of the DisMaxQParser in SOLR.
SOLR already has a field type for analysing payloads
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<fieldtype name="payloads" stored="false" indexed="true" class="solr.TextField" > | |
<analyzer> | |
<tokenizer class="solr.WhitespaceTokenizerFactory"/> | |
<!-- | |
The DelimitedPayloadTokenFilter can put payloads on tokens... for example, | |
a token of "foo|1.4" would be indexed as "foo" with a payload of 1.4f | |
Attributes of the DelimitedPayloadTokenFilterFactory : | |
"delimiter" - a one character delimiter. Default is | (pipe) | |
"encoder" - how to encode the following value into a playload | |
float -> org.apache.lucene.analysis.payloads.FloatEncoder, | |
integer -> o.a.l.a.p.IntegerEncoder | |
identity -> o.a.l.a.p.IdentityEncoder | |
Fully Qualified class name implementing PayloadEncoder, Encoder must have a no arg constructor. | |
--> | |
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/> | |
</analyzer> | |
</fieldtype> |
and we can also define a custom Similarity to use with the payloads
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.digitalpebble.solr; | |
import org.apache.lucene.analysis.payloads.PayloadHelper; | |
import org.apache.lucene.search.DefaultSimilarity; | |
public class PayloadSimilarity extends DefaultSimilarity | |
{ | |
@Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) | |
{ | |
if (length > 0) { | |
return PayloadHelper.decodeFloat(payload, offset); | |
} | |
return 1.0f; | |
} | |
} |
then specify this in the SOLR schema
<!-- schema.xml -->
<similarity class="uk.org.company.solr.PayloadSimilarity" />
<similarity class="uk.org.company.solr.PayloadSimilarity" />
So far so good. We now need a QueryParser plugin in order to use the payloads in the search and as mentioned above, I want to keep the functionalities of the DisMaxQueryParser.
The problem is that we need to specify PayloadTermQuery objects instead of TermQueries which is down deep in the object hierarchies and cannot AFAIK be modified simply from DismaxQueryParser.I have implemented a modified version of DismaxQueryParser which rewrites the main part of the query (a.k.a userQuery in the implementation) and substitutes the TermQueries with PayloadTermQueries.
First we'll create a QParserPlugin
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.digitalpebble.solr; | |
import org.apache.solr.common.params.SolrParams; | |
import org.apache.solr.common.util.NamedList; | |
import org.apache.solr.request.SolrQueryRequest; | |
import org.apache.solr.search.QParser; | |
import org.apache.solr.search.QParserPlugin; | |
public class PLDisMaxQParserPlugin extends QParserPlugin { | |
public void init(NamedList args) { | |
} | |
@Override | |
public QParser createParser(String qstr, SolrParams localParams, | |
SolrParams params, SolrQueryRequest req) { | |
return new PLDisMaxQParser(qstr, localParams, params, req); | |
} | |
} |
which does not do much but simply exposes the PLDisMaxQueryParser which is a modified version of the standard DisMaxQueryParser but with PayloadQuery objects.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.digitalpebble.solr; | |
import java.util.HashSet; | |
import java.util.Iterator; | |
import java.util.List; | |
import java.util.Map; | |
import org.apache.lucene.index.Term; | |
import org.apache.lucene.queryParser.ParseException; | |
import org.apache.lucene.search.BooleanClause; | |
import org.apache.lucene.search.BooleanQuery; | |
import org.apache.lucene.search.DisjunctionMaxQuery; | |
import org.apache.lucene.search.PhraseQuery; | |
import org.apache.lucene.search.Query; | |
import org.apache.lucene.search.TermQuery; | |
import org.apache.lucene.search.payloads.MaxPayloadFunction; | |
import org.apache.lucene.search.payloads.PayloadFunction; | |
import org.apache.lucene.search.payloads.PayloadNearQuery; | |
import org.apache.lucene.search.payloads.PayloadTermQuery; | |
import org.apache.lucene.search.spans.SpanQuery; | |
import org.apache.solr.common.params.DisMaxParams; | |
import org.apache.solr.common.params.SolrParams; | |
import org.apache.solr.common.util.NamedList; | |
import org.apache.solr.request.SolrQueryRequest; | |
import org.apache.solr.search.DisMaxQParser; | |
import org.apache.solr.util.SolrPluginUtils; | |
/** | |
* Modified query parser for dismax queries which uses payloads | |
*/ | |
public class PLDisMaxQParser extends DisMaxQParser { | |
public static final String PAYLOAD_FIELDS_PARAM_NAME = "plf"; | |
public PLDisMaxQParser(String qstr, SolrParams localParams, | |
SolrParams params, SolrQueryRequest req) { | |
super(qstr, localParams, params, req); | |
} | |
protected HashSet<String> payloadFields = new HashSet<String>(); | |
private static final PayloadFunction func = new MaxPayloadFunction(); | |
float tiebreaker = 0f; | |
protected void addMainQuery(BooleanQuery query, SolrParams solrParams) | |
throws ParseException { | |
Map<String, Float> phraseFields = SolrPluginUtils | |
.parseFieldBoosts(solrParams.getParams(DisMaxParams.PF)); | |
tiebreaker = solrParams.getFloat(DisMaxParams.TIE, 0.0f); | |
// get the comma separated list of fields used for payload | |
String[] plfarray = solrParams.get(PAYLOAD_FIELDS_PARAM_NAME, "") | |
.split(","); | |
for (String plf : plfarray) | |
payloadFields.add(plf.trim()); | |
/* | |
* a parser for dealing with user input, which will convert things to | |
* DisjunctionMaxQueries | |
*/ | |
SolrPluginUtils.DisjunctionMaxQueryParser up = getParser(queryFields, | |
DisMaxParams.QS, solrParams, tiebreaker); | |
/* for parsing sloppy phrases using DisjunctionMaxQueries */ | |
SolrPluginUtils.DisjunctionMaxQueryParser pp = getParser(phraseFields, | |
DisMaxParams.PS, solrParams, tiebreaker); | |
/* * * Main User Query * * */ | |
parsedUserQuery = null; | |
String userQuery = getString(); | |
altUserQuery = null; | |
if (userQuery == null || userQuery.trim().length() < 1) { | |
// If no query is specified, we may have an alternate | |
altUserQuery = getAlternateUserQuery(solrParams); | |
query.add(altUserQuery, BooleanClause.Occur.MUST); | |
} else { | |
// There is a valid query string | |
userQuery = SolrPluginUtils.partialEscape( | |
SolrPluginUtils.stripUnbalancedQuotes(userQuery)) | |
.toString(); | |
userQuery = SolrPluginUtils.stripIllegalOperators(userQuery) | |
.toString(); | |
parsedUserQuery = getUserQuery(userQuery, up, solrParams); | |
// recursively rewrite the elements of the query | |
Query payloadedUserQuery = rewriteQueriesAsPLQueries(parsedUserQuery); | |
query.add(payloadedUserQuery, BooleanClause.Occur.MUST); | |
Query phrase = getPhraseQuery(userQuery, pp); | |
if (null != phrase) { | |
query.add(phrase, BooleanClause.Occur.SHOULD); | |
} | |
} | |
} | |
/** Substitutes original query objects with payload ones **/ | |
private Query rewriteQueriesAsPLQueries(Query input) { | |
Query output = input; | |
// rewrite TermQueries | |
if (input instanceof TermQuery) { | |
Term term = ((TermQuery) input).getTerm(); | |
// check that this is done on a field that has payloads | |
if (payloadFields.contains(term.field()) == false) | |
return input; | |
output = new PayloadTermQuery(term, func); | |
} | |
// rewrite PhraseQueries | |
else if (input instanceof PhraseQuery) { | |
PhraseQuery pin = (PhraseQuery) input; | |
Term[] terms = pin.getTerms(); | |
int slop = pin.getSlop(); | |
boolean inorder = false; | |
// check that this is done on a field that has payloads | |
if (terms.length > 0 | |
&& payloadFields.contains(terms[0].field()) == false) | |
return input; | |
SpanQuery[] clauses = new SpanQuery[terms.length]; | |
// phrase queries : keep the default function i.e. average | |
for (int i = 0; i < terms.length; i++) | |
clauses[i] = new PayloadTermQuery(terms[i], func); | |
output = new PayloadNearQuery(clauses, slop, inorder); | |
} | |
// recursively rewrite DJMQs | |
else if (input instanceof DisjunctionMaxQuery) { | |
DisjunctionMaxQuery s = ((DisjunctionMaxQuery) input); | |
DisjunctionMaxQuery t = new DisjunctionMaxQuery(tiebreaker); | |
Iterator<Query> disjunctsiterator = s.iterator(); | |
while (disjunctsiterator.hasNext()) { | |
Query rewrittenQuery = rewriteQueriesAsPLQueries(disjunctsiterator | |
.next()); | |
t.add(rewrittenQuery); | |
} | |
output = t; | |
} | |
// recursively rewrite BooleanQueries | |
else if (input instanceof BooleanQuery) { | |
for (BooleanClause clause : (List<BooleanClause>) ((BooleanQuery) input) | |
.clauses()) { | |
Query rewrittenQuery = rewriteQueriesAsPLQueries(clause | |
.getQuery()); | |
clause.setQuery(rewrittenQuery); | |
} | |
} | |
output.setBoost(input.getBoost()); | |
return output; | |
} | |
public void addDebugInfo(NamedList<Object> debugInfo) { | |
super.addDebugInfo(debugInfo); | |
if (this.payloadFields.size() > 0) { | |
Iterator<String> iter = this.payloadFields.iterator(); | |
while (iter.hasNext()) | |
debugInfo.add("payloadField", iter.next()); | |
} | |
} | |
} | |
Once these 3 classes have been compiled, jarred and put in the classpath of SOLR, we must add
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<queryParser name="payload" class="com.digitalpebble.solr.PLDisMaxQParserPlugin" /> |
to solrconfig.xml.
then specify for the requestHandler :
<str name="defType">payload</str>
<!-- plf : comma separated list of field names --> <str name="plf"> payloads </str>
The fields listed in the parameter plf will be queried with Payload query objects. Remember that you can use &debugQuery=true to get the details of the scores and check that the payloads are being used.
Thursday, 19 August 2010
Tika on FeatherCast
Apache Tika recently split off from the Lucene project and became a separate top level Apache project. Chris Mattmann is talking about what Tika is, and where it’s going on http://feathercast.org/?p=90
Labels:
tika
Friday, 13 August 2010
Towards Nutch 2.0
Nevermind the dodgy look of the blog - I'll improve that later!
For my first post, I'd like to mention the progress we've made recently towards Apache Nutch 2.0. It is based on a branch named NutchBase which has been developed mainly by Doğacan Güney and it is now in the trunk of the SVN repository. One of the main aspects of Nutch 2.0 is that it is now storing its data in a datastore and not in Hadoop's file-based structures. Note that we still have the distribution and replication of the data over a whole cluster and data locality for MapReduce but we also have the possibility to insert or modify a random entry in the table without having to read/write the whole data structure as it was the case before.
Nutch uses a project named GORA as an intermediate between our code and the backend storage. There would be a lot of things to say on GORA but to make it short what we are trying to achieve with it is to make it a sort of common API for NoSQL stores. GORA already has implementations for HBase and Cassandra but also SQL. The plan for GORA is to put it in the Apache Incubator or possibly as an Apache subproject (Hadoop? HBase? Cassandra?). We'll see how it goes.
There are quite a few structural changes in Nutch, most notably the fact that there aren't any segments any more as all the information about a URL (metadata, original content, extracted text, ...) in stored in a single table which means for instance no more segments to merge or metadata to move back to the crawldb. It's all in one place!
There are other substantial changes in 2.0, notably the removal of the Lucene-based indexing and search as we now rely on SOLR. Other indexing backends might be added later. Another step towards delegating functionalities to external projects is the increased used of Apache Tika for the parsing. We've removed quite a few legacy parsers from Nutch and let Tika do the work for us. We've also revamped the organisation of the code and did a lot of code clean up.
Nutch 2.0 is still at an early stage and we are actively working on it, testing, debugging etc... The good news is that it is not only an architectural change but also a basis for a whole lot of new functionalities (see for instance https://issues.apache.org/jira/browse/NUTCH-882).
I'll keep you posted on our progress, as usual : give it a try, get involved, join us...
For my first post, I'd like to mention the progress we've made recently towards Apache Nutch 2.0. It is based on a branch named NutchBase which has been developed mainly by Doğacan Güney and it is now in the trunk of the SVN repository. One of the main aspects of Nutch 2.0 is that it is now storing its data in a datastore and not in Hadoop's file-based structures. Note that we still have the distribution and replication of the data over a whole cluster and data locality for MapReduce but we also have the possibility to insert or modify a random entry in the table without having to read/write the whole data structure as it was the case before.
Nutch uses a project named GORA as an intermediate between our code and the backend storage. There would be a lot of things to say on GORA but to make it short what we are trying to achieve with it is to make it a sort of common API for NoSQL stores. GORA already has implementations for HBase and Cassandra but also SQL. The plan for GORA is to put it in the Apache Incubator or possibly as an Apache subproject (Hadoop? HBase? Cassandra?). We'll see how it goes.
There are quite a few structural changes in Nutch, most notably the fact that there aren't any segments any more as all the information about a URL (metadata, original content, extracted text, ...) in stored in a single table which means for instance no more segments to merge or metadata to move back to the crawldb. It's all in one place!
There are other substantial changes in 2.0, notably the removal of the Lucene-based indexing and search as we now rely on SOLR. Other indexing backends might be added later. Another step towards delegating functionalities to external projects is the increased used of Apache Tika for the parsing. We've removed quite a few legacy parsers from Nutch and let Tika do the work for us. We've also revamped the organisation of the code and did a lot of code clean up.
Nutch 2.0 is still at an early stage and we are actively working on it, testing, debugging etc... The good news is that it is not only an architectural change but also a basis for a whole lot of new functionalities (see for instance https://issues.apache.org/jira/browse/NUTCH-882).
I'll keep you posted on our progress, as usual : give it a try, get involved, join us...
Subscribe to:
Posts (Atom)