lemmagen3
is a Python 2/3 wrapper for the Lemmagen lemmatizer (version 2.2).
It is different from other Lemmagen wrappers like this one on PyPi because it offers a clean, fast OO interface built with the excellent pybind11 library and supports an additional language (Croatian).
Models for Slovene, Croatian and Serbian are significantly updated and make use of frequency data to prefer most frequent lemmas, e.g., for Slovene: je->biti
instead of je->jesti
, mene->jaz
instead od mene->mena
, od->od
instead of od->oda
etc.
In total, 19 languages are supported:
- Bulgarian:
bg
- Croatian:
hr
- Czech:
cs
- English:
en
- Estonian:
et
- Farsi/Persian:
fa
- French:
fr
- German:
de
- Hungarian:
hu
- Italian:
it
- Macedonian:
mk
- Polish:
pl
- Romanian:
ro
- Russian:
ru
- Serbian:
sr
- Slovak:
sk
- Slovene:
sl
- Spanish:
es
- Ukrainian:
uk
pip install lemmagen3
will install the module and language model files. Please note that on python <=3.5
and python 2.7
the package will be built from source so you will need a C++ compiler.
Note: If you use python 3.5.0
or 3.5.1
you will likely get the error shown below. This is a known bug in these two versions so please consider upgrading your Python.
ImportError: ..._lemmagen.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _PyThreadState_UncheckedGet
The following snippet illustrates how to use lemmagen3
.
from lemmagen3 import Lemmatizer
# first, we can list all supported languages
print(Lemmatizer.list_supported_languages())
# then, create few lemmatizer objects using ISO 639-1 language codes
# (English, Slovene and Russian)
lem_en = Lemmatizer('en')
lem_sl = Lemmatizer('sl')
lem_ru = Lemmatizer('ru')
# now lemmatize the word "cats" in all three languages
print(lem_en.lemmatize('cats'))
print(lem_sl.lemmatize('mačke'))
print(lem_ru.lemmatize('коты'))
# you can also change the language for an existing Lemmatizer object
# lem_en will now become a French lemmatizer:
lem_en.load_language('fr')
# finally, you can also load your own Lemmagen model
my_lem = Lemmatizer()
my_lem.load_model('/path/to/my/model')
Note that the function lemmatize
accepts single string tokens and does not split the input string! If you want to lemmatize a chunk of text you will have to tokenize it first, e.g.:
sentence = 'cats hate dogs'
tokens = sentence.split()
sentence_lemmatized = ' '.join([lem_en.lemmatize(token) for token in tokens])
Note also that lemmagen3
operates on unicode encoded strings so if you use python 2 make sure that your input string is encoded as unicode.
Please note that this repository contains code and binary models compiled and built from different sources which are under different licenses:
- C++ files and headers come from Lemmagen and are modified and adapted to work as a Python module (LGPL)
- Binary models are built from Multext and Multext-east sources:
- Language resources used to build Farsi/Persian, Macedonian, Polish, and Russian models are for non-commercial use only.
- Language resource for other supported languages are released under CC BY-SA 4.0.
The rest of the code in this repository was created by the author and is licensed under the MIT license.
lemmagen3
is developed by Vid Podpečan (vid.podpecan@ijs.si).- The Lemmagen lemmatizer was developed by Matjaž Juršič.