Morpheme

This repository contains dictionaries of prefixies and suffixies meanings; python scripts for compiling a new vector model of Russian corps, that is based on morpheme words decomposition

TODO:

Make splitting on morphemes requires at least on root
How to split "случайный", "кофейный", "филейный", "прощай", "релейный", "зазнайка", "водогрейный", "водогрейня", "попрошайка", "сотейник", "желейный", "поезжай" - "й" - suffix?
"гулянье", "кривлянье": "ь" - suffix? We have not found this suffix in dicts
Need to find accent dictionary, it affects sometimes on the suffix decision. For example: https://drive.google.com/file/d/0B19_r4ZqIbD5ZXU2V3ZTU3psX1U/view
Add small Epsilon (Probability of splitting into morphemes)
Try Markov model p->r->s->e ...
The accuracy of parsing (91%) can be improved by increasing the data
Part of the paper with splitting into morphemes algorithm can be a separate article!
Parametric model will be better (formula)
We can teach word2vec with morphemes, not with words
Test the model on more tasks

Remark: w2v model does not contains these words, but our model contain. недокомпьютеризация сверхмодерновый загипнотизированный подвыпивший безнравственная уработался заработался сосисочки венички заделаться моднявый

Folders:

dicts - our vacabularies.

suffixes.txt - vocabulary of the suffixes with meanings and examples.
prefixes.txt - vocabulary of the prefixes with meanings and examples.
roots.txt - vocabulary of the roots with meanings and examples.

The format of every line in these vocabularies is:

morpheme type - morpheme; morpheme type - morpheme; ....

For example:

goodness: root - good; suffix - ness.
all_words_like_morphemes.txt - vocabulary of the words segmentation.

TanyaKovalenko/Morpheme

Morpheme

Folders: