A Lucene custom analyzer that allow for indexing multiple tokens for a single term it supports storing terms with Mixed-case letters and terms with punctuation. This in theory should produce more accurate results as it causes Lucene to perform a more exact search.
Released under the MIT License.
$ git clone https://github.com/eXist-db/exist.git
$ cd exist
$ git checkout master
$ mvn -DskipTests package
we will refer to the exist-db directory as $EXIST_HOME
you can set it using
Linux/macOS:
$ export EXIST_HOME=/your/path/to/eXist-db
Windows:
$ set EXIST_HOME=C:\your\path\to\eXist-db
Linux/macOS:
$ cp PLMultiAnalyzer-1.0.0-SNAPSHOT.jar $EXIST_HOME/exist-distribution/target/exist-distribution-[version]-dir/lib
Windows:
$ copy PLMultiAnalyzer-1.0.0-SNAPSHOT.jar %EXIST_HOME%\exist-distribution\target\exist-distribution-[version]-dir\lib
in your $EXIST_HOME/exist-distribution/target/exist-distribution-[version]-dir/etc/startup.xml
add to the dependencies
<dependencies>
... <!-- other dependencies -->
<dependency>
<groupId>com.evolvedbinary.lucene.analyzer</groupId>
<artifactId>ohAnalyzer</artifactId>
<version>1.0.0-SNAPSHOT</version>
<relativePath>PLMultiAnalyzer-1.0.0-SNAPSHOT.jar</relativePath> <!-- must be exact match to the jar in lib folder -->
</dependency>
... <!-- other dependencies -->
<dependencies>
run the start up script Linux/macOS:
$ $EXIST_HOME/exist-distribution/target/exist-distribution-[version]-dir/bin/startup.sh
Windows:
$ %EXIST_HOME%\exist-distribution\target\exist-distribution-[version]-dir\bin\startup.bat
when creating the index config specify the Analyzer
as com.evolvedbinary.lucene.analyzer.OhAnalyzer
the Analyzer
needs two parameters
minimumTermLength
: the minimum length of any decomposed term, any smaller decomposed terms will be discarded. Set to 0 to indicate no minimum.punctuationDictionary
: the dictionary of punctuation to use for decomposition.
<collection xmlns="http://exist-db.org/collection-config/1.0">
<index xmlns:wiki="http://exist-db.org/xquery/wiki" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:atom="http://www.w3.org/2005/Atom">
<!-- Lucene index is configured below -->
<lucene>
<analyzer class="com.evolvedbinary.lucene.analyzer.OhAnalyzer">
<param name="punctuationDictionary" type="char[]">
<value>'</value>
<value>-</value>
<value>’</value>
</param>
<param name="minimumTermLength" type="int" value="2" />
<analyzer>
<text qname="doc"/>
</lucene>
</index>
</collection>