This project depends on the Flax lucene-solr-intervals fork of Lucene/Solr. Before building, download lucene-solr-intervals from https://github.com/flaxsearch/lucene-solr-intervals and follow the maven build instructions.
Once that's done, you can build and deploy Luwak by running mvn install.
You can find out a bit more about how Flax use Luwak for media monitoring applications in this video from Lucene Revolution 2013 http://www.youtube.com/watch?v=rmRCsrJp2A8
A small demo program is included in the distribution that will run queries provided in a text file over a small corpus of documents from project gutenberg (via nltk).
./run-demo
Basic usage looks like this:
Monitor monitor = new Monitor(new TermFilteredPresearcher());
MonitorQuery mq = new MonitorQuery("query1", new TermQuery(new Term(textfield, "test")));
monitor.update(mq);
InputDocument doc = InputDocument.builder("doc1")
.addField(textfield, document, WHITESPACE)
.build();
DocumentMatches matches = monitor.match(doc);
The monitor uses a Presearcher
implementation to reduce the number of queries it runs
during a match
run. Luwak comes with three presearcher implementations.
This Presearcher does no filtering whatsoever, so the monitor will run all its registered
queries against every document passed to match
.
This Presearcher extracts terms from each registered query and indexes the queries against them
in the Monitor's internal index. At match-time, the passed-in InputDocument
is tokenized
and converted to a disjunction query. All queries that match this query in the monitor's index
are then run against the document.
Only whole terms are extracted from the InputDocument
, so any queries that use fuzzy or
partial matching, such as RegexpQueries, are stored using a special AnyToken
that matches
all documents.
A specialization of TermFilteredPresearcher
that also extracts ngrams from InputDocument
s,
and matches them against exact substrings of fuzzy terms. This presearcher trades longer document
preparation times for more exact query filtering. Whether it is more appropriate than
TermFilteredPresearcher
will depend on the queries and documents being used.
TermFilteredPresearcher
uses a set of Extractor<T extends Query>
objects to extract terms
from registered queries for indexing. If a passed-in query does not have a specialised Extractor,
the presearcher will fall back to using a GenericTermExtractor
, which just uses Query#extractTerms(Set)
.
This will not be appropriate for all custom Query types. You can create your own custom extractor by
subclassing Extractor
, and then pass it to the TermFilteredPresearcher
constructor.
public class CustomQueryExtractor extends Extractor<CustomQuery> {
public CustomQueryExtractor() {
super(CustomQuery.class);
}
@Override
public void extract(CustomQuery query, List<QueryTerm> terms,
List<Extractor<?>> extractors) {
terms.add(getYourTermsFromCustomQuery(query));
}
}
Presearcher presearcher = new TermFilteredPresearcher(new CustomQueryExtractor());
You can implement your own query filtering code by subclassing Presearcher
. You will need
to implement buildQuery(InputDocument)
which converts incoming documents into queries to
be run against the Monitor's query index, and indexQuery(Query)
which converts registered
queries into a form that can be indexed.
Note that indexQuery(Query)
may not create fields named 'id' or 'del_id', as these are reserved
by the Monitor's internal index.