/bg-stemmer

Stemmer for Bulgarian based on a Trie with all the word forms

Primary LanguageJavaMIT LicenseMIT

BG-Stemmer

This is an experimental stemmer for Bulgarian. The two alternatives are the light default rule-based Lucene stemmer, and Preslav Nakov's BulStem, which is an inflectional stemmer.

This one relies on initially loading all word forms into a trie, and then for each word fetching the corresponding base form. It is less space-efficient than the other two which rely just on rules, but benchmarks show that it is significantly faster than BulStem and on par with the default Lucene stemmer.

The dictionary alongside with the affixation rules are taken from OpenOffice.

Integrating with Solr

You need to simply add the jar file (taken from the latest release), as well as the guava (v.22) and lib/patricia-trie jars on the classpath and add the following in your Solr configuration

<filter class="bg.bozho.stemmer.BulgarianStemFilterFactory"/>

Integrating with ElasticSearch

TODO