This repository contains dictionaries of prefixies and suffixies meanings; python scripts for compiling a new vector model of Russian corps, that is based on morpheme words decomposition
TODO:
- Make splitting on morphemes requires at least on root
- How to split "случайный", "кофейный", "филейный", "прощай", "релейный", "зазнайка", "водогрейный", "водогрейня", "попрошайка", "сотейник", "желейный", "поезжай" - "й" - suffix?
- "гулянье", "кривлянье": "ь" - suffix? We have not found this suffix in dicts
- Need to find accent dictionary, it affects sometimes on the suffix decision. For example: https://drive.google.com/file/d/0B19_r4ZqIbD5ZXU2V3ZTU3psX1U/view
- Add small Epsilon (Probability of splitting into morphemes)
- Try Markov model p->r->s->e ...
- The accuracy of parsing (91%) can be improved by increasing the data
- Part of the paper with splitting into morphemes algorithm can be a separate article!
- Parametric model will be better (formula)
- We can teach word2vec with morphemes, not with words
- Test the model on more tasks
Remark: w2v model does not contains these words, but our model contain. недокомпьютеризация сверхмодерновый загипнотизированный подвыпивший безнравственная уработался заработался сосисочки венички заделаться моднявый
dicts - our vacabularies.
-
suffixes.txt - vocabulary of the suffixes with meanings and examples.
-
prefixes.txt - vocabulary of the prefixes with meanings and examples.
-
roots.txt - vocabulary of the roots with meanings and examples.
The format of every line in these vocabularies is:
morpheme type - morpheme; morpheme type - morpheme; ....
For example:
goodness: root - good; suffix - ness.
-
all_words_like_morphemes.txt - vocabulary of the words segmentation.