text-morpher
Features and goals
Basically this module was designed to serve a single purpose: generating morphologically correct sentences in russian to use them in a chatbot. So the main feature of this module sounds like:
Generate me a sentence of a given structure for a given topic
It looks very simple on the first sight, but when it comes to the russian morphology things get complicated, since you have to put your words in proper genders, forms, case etc. So, over the time it started to get more and more features and here is what it can do at the moment:
- Detect part of speech and other grammar tags of any word
- Convert words in text into inline templates with grammar metadata
- Put a word to any chosen form (and get it's initial form from any other)
- Use Word2Vec models to search words close by meaning
- Find matches to any word by:
- Syllable length
- Vowels structure: папа -> мама
- Accent syllable: долото -> паразит
- Grammar properties
- Regexp (match first/last letter of the orginal word etc.)
- And meaning (!): яблоко -> зеленое, круглое
These features may be use to (just examples out of my head):
- Make all words in the text start with the sale letter
- Replace all adjectives with close (or contrary) ones by meaning
- Update poetry by replacing words matching by syllables and accent
- Lookup which adjectives is related to a chosen noun
- This could be an endless list :)
How it works
Here is a simple example to give you an idea of how Morhper is processing text. Let's pick this sting as input:
"мама мыла раму"
It converts each word into a tag, containing all morphological properties of the word, and returns a string that looks somewhat like:
{{NOUN/.*/femn,sing,nomn/мама}}
{{VERB/.*/impf,tran,femn,sing,past,indc/мыла}}
{{NOUN/.*/femn,sing,accs/раму}}
The structure of each tag is quite simple:
- Part of speech
- Regexp to filter matches
- Set of grammar tags (gender, tense, plural, case etc.)
- Original word
Then using function Morpher.runTemplate we can process this string with tags and find matches for each word. Searching for matches could be configured with parameters provided as a second argument to this function. By default set of parameters is taken from ./morpher.config.json file in the root folder.
So if we try to run this template without using Word2Vec model - we get morphologically correct, but quite meaningless replacements, like:
"француженка клеила миску"
"страница брала бороду"
"харчевня пенила клятву"
Then, if we turn context search option on - the output word will be way more relevant to original:
"девочка слала шахту"
"соседка пугала веранду"
"сестрёнка меняла рекламу"
Althou, the whole output sentence may not have the same meaning as the original one. :) This is one of the major TO-DO's at the moment.
Installation
- Install NodeJS,
- Run
npm i -g text-morpher
to install package globally
This module has a built-in dictionary for about ~22K words, so it kinda works right after the installation. But the most interesting part of it require to use word2vec model file to enable context search features. In order to use it - we need to install a word2vector module and download the word2vec model for russian language. Since it is quite an expensive feature (the model is about 600Mb in size and when it is loaded it takes the same amount of RAM) I left it disabled by default. But you need just to run one extra command to make it all automatically. An important note: on Windows it will ask for admin privileges to compile a c++ binary, but don't worry - you may check /scripts/postInstall.js and make sure it doesn't do any harm. :)
- Run
text-morpher install
to add word2vec support
That's it - now you may use Morpher class in your app or run in from CLI
Configuration
There is a morpher.config.json file in the root folder. Morpher use it to look for matches when it processing tags. Here is all the options available at the moment:
{
// Parts of speech to process
// (see the whole list here http://opencorpora.org/dict.php?act=gram)
// Anythig else will be skipped and left in the original form
"parts": ["NOUN","ADJF","VERB","INFN"],
// Match the first letter of the original word
"first": false,
// Match the last letter of the original word
"last": false,
// Match the length of the original word
"length": false,
// Match all the vowels (молоко -> долото)
"vowels": false,
// Match syllable count (горбившиеся -> гусеницами)
"syllables": false,
// Match the accent syllable (пого`да -> профе`ссия)
"accent": false,
// Match the accent vowel (пого`да -> забо`та)
"accentLetter": false,
// Match all the original grammar tags (gender, tense, plural, case etc.)
// (see the whole list here http://opencorpora.org/dict.php?act=gram)
"tags": false,
// false or a string with regexp
"regexp": false,
// Enable search by word2vec model
"contextSearch": true,
// Treshold for search by word2vec model - float
"treshHold": 0.6
}
The same structure could be passed as an object to Morpher.runTemplate to override any of these options.
CLI Usage
At the moment two commands are supported from command line:
-
text-morpher cli - this command starts morpher in interactive mode.
- If you type any text - it will substitute it with any matches with default matching options from ./morpher.config.json
- If you enter just a single word - it will try find the best matching word in word2vec model. (яблоко -> печёное, сумка -> дерматиновая, красный -> буржуй еtc.)
- If you enter just a single word with tilda, like this:
~свобода
, it will show you all the closest matches from word2vec model.
-
text-morpher morph - this command has two options:
input
andoutput
, where you can define input textfile to process and the name of an output file. Morpher will process all text from input file and save it in the output
API Usage
This part is a subject to be updated soon. Like, very soon. :) But if you're interested - there are API call examples in /test folder
Useful links
OpenCorpora grammar tag list
Word2Vec models
Word2Vec technology reference
Google NGrams
Thanks to
Denis Olshin me@denull.ru for Az.js
Mikhail Korobov kmike84@gmail.com for Pymorphy
Lee Xun speachlesslee@gmail.com for word2vector