BulStem-py: A Python Re-implementation of BulStem - inflectional stemmer for Bulgarian

Introduction

This is the Python version of the BulStem stemming algorithm. It follows the algorithm presented in

Nakov, P. BulStem: Design and evaluation of inflectional stemmer for Bulgarian. In Workshop on 
Balkan Language Resources and Tools (Balkan Conference in Informatics).

See http://people.ischool.berkeley.edu/~nakov/bulstem/ for the homepage of the algorithm. Also, check the original paper for more details and examples.

Implementation

This implementation, in contrast of the other available uses a Trie, instead of Dictionary/Hashtable/, in order to find the longest possible rule, that can be applied to a token.

Basic algorithm steps:

Find the position of the first vowel in the token.
Find the longest possible rule by traversing the string in reverse order until there is a matching suffix, or down to the position of the first vowel (found in Step. 1).
Prepend the non-stemmed prefix to the stemmed suffix (Step. 2).

Installation

This library is compatible with Python >= 3.6.

Clone the repository and run:

With pip

pip install -e .
pip install -r requirements.txt

Test

A set of tests are included in the project, under the tests folder. The test suit can be run as follows:

pip install -e ".[testing]"
pip install -r requirements-test.txt
python -m unittest

Usage

The library works with a set of rules used for stemming. The rules can be either passed as a list to the BulStemmer constructor, or as a path to a file.

For both options the rules need to be formatted as follows:

word ==> stem ==> freq

A pre-defined set of rules is included in the package, and can be used directly. The stemming rules can be found here. (examples: Reading the rules from an external file)

Manually loading rules

from bulstem.stem import BulStemmer

stemmer = BulStemmer(["ой ==> о 10"], min_freq=0, left_context=2)
stemmer.stem('порой')# Excepted output: 1. 'поро'

BulStemmer constructor params:

rules - Iterable of strings containing rules.
min_freq - The minimum frequency of a rule to be used when stemming.
left_context - Size of the prefix which will not be stemmed.

Reading the rules from an external file

from bulstem.stem import BulStemmer


# Pre-defined names of rule sets
PRE_DEFINED_RULES = ['stem-context-1', 
                     'stem-context-2',
                     'stem-context-3']
                     
# Excepted output:
# 1 втор
# 2 втори
# 3 вторият
for i, rules_name in enumerate(PRE_DEFINED_RULES, start=1):
    stemmer = BulStemmer.from_file(rules_name, min_freq=2, left_context=i)
    print(i, stemmer.stem('вторият'))

stemmer = BulStemmer.from_file('stem_rules_context_2_utf8.txt', min_freq=2, left_context=i)
stemmer.stem('вторият') # Excepted output: 1. 'втори'
stemmer.stem('вероятен') # Excepted output: 1. 'вероят'

BulStemmer.from_file params:

path - Path (or pre-defined name) to the rules file formatted as follows: word ==> stem ==> freq.
min_freq - The minimum frequency of a rule to be used when stemming.
left_context - Size of the prefix which will not be stemmed.

Other implementations

Perl (Original), Java (JDK 1.4), Ruby, C#, Python2, GATE plugin (Java)

License

For license information, see LICENSE.

mhardalov/bulstem-py