This is a script to generate a MySQL database containing 142,679 Russian words and their inflected morphologies. It is adapted from original work by М. Хаген and derivative work Yury Korol. Mostly, I've collected the database and the source text file in one location and made some changes in the database structure to make it more reflective of English terminology (for example "soul" → "animacy", etc.)
(N.B. A word about the origin of this data. It seems to be derived from a work by М. Хаген "Полная парадигма. Морфология." This text is circulating on the web in various places and I'm unaware of whether this is licensed or the intellectual property of it is protected in some way. Any infringement is unintentional. This source text re-encoded in UTF-8 is included in the repository.) *
The parsing of the original text file was done by Yury Korol as described in this blog post - "Морфологический словарь русского языка в виде SQL скрипта". The format is logical for the most part and is noted below. I may work with this a little to clear up issues related to biaspectual verbs.
The MySQL database consists of a single table words
containing the root words and inflected forms.
Field | Original field name* | Type | Explanation |
---|---|---|---|
IID | IID | int | key (суррогатный ключ) |
word | word | varchar[100] | word form (root or inflected) |
code | code | int | integer code of this word form |
code_parent | code_parent | int | an integer that refers to the base word form |
type | type | foreign key | row id in ru_words_types table |
type_sub | type_sub | enum | [1] |
type_ssub | type_ssub | enum | [2] |
plural | plural | boolean | 1 if plural or 0 otherwise |
gender | gender | foreign key | |
wcase | wcase | foreign key | |
comp | comp | enum | comparative, superlative degree; one of 'сравн','прев' |
animacy | soul | boolean | 1 if animate, 0 if inanimate |
v_transit | transit | enum | transitive vs intransitive; one of 'перех','непер','пер/не'; переходность глагола. |
v_perfect | perfect | boolean | 1 if perfective, 0 if imperfective |
v_person | face | enum | person (лицо глагола), 1,2, or 3 or 0 = ‘безл’ |
v_biaspect | kind | enum | 1 if biaspectual, 0 if not; None if not a verb |
v_tense | time | enum | one of ‘прош’, ‘наст’, ‘буд’ |
v_inf | inf | boolean | 1 if an infinitive, 0 otherwise |
v_reflect | vozv | boolean | 1 if reflexive, 0 otherwise |
v_voice | nakl | enum | one of 'пов','страд'; наклонение или залог глагола. 'пов' = 'повелительное наклонение |
short | short | bool | 1 if adjective short form, 0 or NULL otherwise |
v_activepart | - | varchar[5] | type of participle |
[ru_words] |
* A comment from the database creator, notes the following about licensing:
База распространялась бесплатно автором в виде текстовых файлов. Я лишь потрудился перевести её в sql скрипт. Потому распространяется это скрипт «бездваздмездно», с учетом того, что вы используете это как есть, принимая на себя ответственность за все возможные риски.
† This pertains, of course, only to verbs. As a practical matter in the table as previously parsed, the value of this field is always either NULL
or "2вид". The meaning of the latter appears to be "biaspectual" and in that case the perfect
field is NULL
.
This table is ru_words_types
.
Field | Type | Explanation |
---|---|---|
id | TINYINT | row index |
pos | VARCHAR[6] | Russian desc as in original |
pos_full | VARCHAR[15] | Full Russian desc |
pos_en | VARCHAR[20] | English desc |
upos | VARCHAR[5] | UD UPOS type |
This table is ru_words_cases
.
Field | Type | Explanation |
---|---|---|
id | TINYINT | row index |
wcase | VARCHAR[4] | Russian desc as in original |
case_full | VARCHAR[15] | Full Russian desc |
case_en | VARCHAR[15] | English desc |
case_ud | VARCHAR[5] | Case universal dependency |
This table is ru_words_genders
.
Field | Type | Explanation |
---|---|---|
id | TINYINT | row index |
ru | VARCHAR | Russian desc as in original |
ru_full | VARCHAR | Full Russian desc |
en | VARCHAR | English desc |
Another derivative project from Hagen's text file is a sqlite3 database with a Java access wrapper Github: sdreger/WordsFinder. This developer has made some attempt to simplify the structure a bit and make the data more relational.
part of speech - one of 'част','межд','прл','прч','сущ','нар','гл','дееп', 'союз','предик','предл','ввод','мест','числ'
TBD
Some examples of useful database queries:
- Find the genitive (singular and plural) of the word собака:
SELECT w.code, w.code_parent, w.word, w.wcase
FROM ru_words w
INNER JOIN ru_roots AS r on w.code_parent = r.code
INNER JOIN ru_words_cases c ON c.id = w.wcase
WHERE r.word = 'собака' AND c.case_en = 'genitive';
- Is собака animate or inanimate?
SELECT DISTINCT
CASE
WHEN w.animacy = 1 THEN 'animate'
WHEN w.animacy = 0 THEN 'inanimate'
ELSE 'N/A'
END AS AnimacyText
FROM ru_words w
INNER JOIN ru_roots AS r on w.code_parent = r.code
WHERE r.word = 'собака'
- All of the participle forms of делать
SELECT * FROM ru_words
INNER JOIN ru_roots ON ru_roots.code = ru_words.code_parent
WHERE ru_roots.word = 'делать'
AND ru_words.type IN (SELECT id FROM ru_words_types WHERE pos_en = 'participle')
[1]: Subtypes: one of 'поряд', 'кол', 'собир', 'неопр', 'врем', 'обст', 'опред', 'счет', 'неизм'; подтипы используются в основном для числительных и наречий.
[2]: Sub-subtype - one of 'кач','спос','степ','места','напр','врем','цель', 'причин'; под-подтипы задействованы только для наречий.
grammatical case, one of 'им','род','дат','вин','тв','пр','зват','парт','мест'
one of 'муж','жен','ср','общ'