The Voikko Analysis plugin provides Finnish language analysis using Voikko.
Plugin version | Elasticsearch version |
---|---|
0.6.0 | 7.3.2 |
0.5.0 | 5.1.1 |
0.4.0 | 2.2.1 |
0.3.0 | 1.5.2 |
If you are not installing the latest version, follow the links in the table to see installation instructions for the old version.
The plugin needs libvoikko
shared library to work. Details of installing the library varies
based on operating system. In Debian based systems apt-get install libvoikko1
should work.
Next, you'll need to download morpho dictionary
(for libvoikko version 4.0+, use morpho dict v5
instead).
Unzip this into Voikko's dictionary directory (e.g. /usr/lib/voikko
in Debian) or into a directory you specify with
dictionaryPath
configuration property.
Finally, to install the plugin, run:
bin/elasticsearch-plugin install https://github.com/EvidentSolutions/elasticsearch-analysis-voikko/releases/download/v0.6.0/elasticsearch-analysis-voikko-0.6.0.zip
Elasticsearch ships with a pretty restrictive security policy. Plugins can specify the permissions
that they need in plugin-security.policy
. However, elasticsearch-analysis-voikko uses
JNA library which is already distributed with Elasticsearch
and therefore can't be included in the plugin zip. This means that the security policy bundled with the
plugin will not apply to JNA, yet it should be able to load libvoikko
from the system.
Therefore you need to create a custom security policy, granting Elasticsearch itself the permission
to load libvoikko
:
grant {
permission java.io.FilePermission "<<ALL FILES>>", "read";
permission java.lang.reflect.ReflectPermission "newProxyInPackage.org.puimula.libvoikko";
};
(You don't really need to grant read access to <<ALL FILES>>
, you can pass the location
of libvoikko
instead.)
Save this as custom-elasticsearch.policy
and tell Elasticsearch to load it:
export ES_JAVA_OPTS=-Djava.security.policy=file:/path/to/custom-elasticsearch.policy
After installing the plugin, you can quickly verify that it works by executing:
curl -XGET 'localhost:9200/_analyze' -d '
{
"tokenizer" : "finnish",
"filter" : [{"type": "voikko", "libraryPath": "/directory/of/libvoikko", "dictionaryPath": "/directory/of/voikko/dictionaries"}],
"text" : "Testataan voikon analyysiä tällä tavalla yksinkertaisesti."
}'
If this works without error messages, you can proceed to configure the plugin index.
Include finnish
tokenizer and voikko
filter in your analyzer, for example:
{
"index": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "finnish",
"filter": ["lowercase", "voikkoFilter"]
}
},
"filter": {
"voikkoFilter": {
"type": "voikko"
}
}
}
}
}
You can use the following filter options to customize the behaviour of the filter:
Parameter | Default value | Description |
---|---|---|
language | fi_FI | Language to use |
dictionaryPath | system dependent | path to voikko dictionaries |
analyzeAll | false | Use all analysis possibilities or just the first |
minimumWordSize | 3 | minimum length of words to analyze |
maximumWordSize | 100 | maximum length of words to analyze |
libraryPath | system dependent | path to directory containing libvoikko |
poolMaxSize | 10 | maximum amount of Voikko-instances to pool |
analysisCacheSize | 1024 | number of analysis results to cache |
To run the tests, you need to specify voikko.home
system property which should point to
a directory containing libvoikko shared library and subdirectory dicts
which contains
the morpho dictionary.
This library is released under the Apache License, Version 2.0.