UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
A python wrapper for the Russian finite-state transducer described originally in chapter 2 of my dissertation.
If you use this work in your research please cite the following:
Reynolds, Robert J. "Russian natural language processing for computer-assisted language learning: capturing the benefits of deep morphological analysis in real-life applications" PhD Diss., UiT–The Arctic University of Norway, 2016. https://hdl.handle.net/10037/9685
For all features to be available, you should have hfst
and vislcg3
installed as command-line utilities. Specifically, hfst
is needed for
FST-based tokenization, and vislcg3
is needed for grammatical disambiguation.
The version used to successfully test the code is included in each commit in
this file. The recommended method
for installing these dependencies is as follows:
$ curl https://apertium.projectjj.com/osx/install-nightly.sh | sudo bash
$ wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
$ sudo apt-get install cg3 hfst hfst-dev
This package can be installed from PyPI using the usual...
$ python -m pip install --user udar
...or directly from this repository using...
$ python3 -m pip install --user git+https://github.com/reynoldsnlp/udar
NB! Documentation is currently limited to docstrings. I recommend that you use
help()
frequently to see how to use classes and methods. For example, to see what options are available for building aDocument
, tryhelp(Document)
.
The most common use-case is to use the Document
constructor to automatically
tokenize and analyze a text. If you print()
a Document
object, the result
is an XFST
/HFST
stream:
import udar
doc1 = udar.Document('Мы удивились простоте системы.')
print(doc1)
# Мы мы+Pron+Pers+Pl1+Nom 0.000000
#
# удивились удивиться+V+Perf+IV+Pst+MFN+Pl 5.078125
#
# простоте простота+N+Fem+Inan+Sg+Dat 4.210938
# простоте простота+N+Fem+Inan+Sg+Loc 4.210938
#
# системы система+N+Fem+Inan+Pl+Acc 5.429688
# системы система+N+Fem+Inan+Pl+Nom 5.429688
# системы система+N+Fem+Inan+Sg+Gen 5.429688
#
# . .+CLB 0.000000
Passing the argument disambiguate=True
, or running doc1.disambiguate()
after the fact will run a Constraint Grammar to remove as many ambiguous
readings as possible. This grammar is far from complete, so some ambiguous
readings will remain.
Property | Type | Description |
---|---|---|
text | str |
Original text of this document |
sentences | List[Sentence] |
List of sentences in this document |
num_tokens | int |
Number of tokens in this document |
features | tuple |
udar.features.FeatureExtractor stores extracted features here |
Document
objects have convenient methods for adding stress or converting to
phonetic transcription.
Method | Return type | Description |
---|---|---|
stressed | str |
The original text of the document with stress marks |
phonetic | str |
The original text converted to phonetic transcription |
transliterate | str |
The original text converted to Romanized Cyrillic (default=Scholarly) |
disambiguate | None |
Disambiguate readings using the Constraint Grammar |
cg3_str | str |
Analysis stream in the VISL-CG3 format |
from_cg3 | Document |
Create Document from VISL-CG3 format stream |
hfst_str | str |
Analysis stream in the XFST/HFST format |
from_hfst | Document |
Create Document from XFST/HFST format stream |
to_dict | list |
Convert to a complex list object |
to_json | str |
Convert to a JSON string |
stressed_doc1 = doc1.stressed()
print(stressed_doc1)
# Мы́ удиви́лись простоте́ систе́мы.
ambig_doc = udar.Document('Твои слова ничего не значат.', disambiguate=True)
print(sorted(ambig_doc[1].stresses())) # Note that слова is still ambiguous
# ['сло́ва', 'слова́']
print(ambig_doc.stressed(selection='safe')) # 'safe' skips сло́ва and слова́
# Твои́ слова ничего́ не зна́чат.
print(ambig_doc.stressed(selection='all')) # 'all' combines сло́ва and слова́
# Твои́ сло́ва́ ничего́ не зна́чат.
print(ambig_doc.stressed(selection='rand') in {'Твои́ сло́ва ничего́ не зна́чат.', 'Твои́ слова́ ничего́ не зна́чат.'}) # 'rand' randomly chooses between сло́ва and слова́
# True
phonetic_doc1 = doc1.phonetic()
print(phonetic_doc1)
# мы́ уд'ив'и́л'ис' пръстʌт'э́ с'ис'т'э́мы.
Property | Type | Description |
---|---|---|
doc | Document |
"Back pointer" to the parent document of this sentence |
text | str |
Original text of this sentence |
tokens | List[Token] |
The list of tokens in this sentence |
id | str |
(optional) Sentence id, if assigned at creation |
Method | Return type | Description |
---|---|---|
stressed | str |
The original text of the sentence with stress marks |
phonetic | str |
The original text converted to phonetic transcription |
transliterate | str |
The original text converted to Romanized Cyrillic (default=Scholarly) |
disambiguate | None |
Disambiguate readings using the Constraint Grammar |
cg3_str | str |
Analysis stream in the VISL-CG3 format |
from_cg3 | Sentence |
Create Sentence from VISL-CG3 format stream |
hfst_str | str |
Analysis stream in the XFST/HFST format |
from_hfst | Sentence |
Create Sentence from XFST/HFST format stream |
to_dict | list |
Convert to a complex list object |
to_json | str |
Convert to a JSON string |
Property | Type | Description |
---|---|---|
id | str |
The index of this token in the sentence, 1-based |
text | str |
The original text of this token |
misc | str |
Miscellaneous annotations with regard to this token |
lemmas | Set[str] |
All possible lemmas, based on remaining readings |
readings | List[Reading] |
List of readings not removed by the Constraint Grammar |
removed_readings | List[Reading] |
List of readings removed by the Constraint Grammar |
deprel | str |
The dependency relation between this word and its syntactic head. Example: ‘nmod’. |
Method | Return type | Description |
---|---|---|
stresses | Set[str] |
All possible stressed wordforms, based on remaining readings |
stressed | str |
The original text of the sentence with stress marks |
phonetic | str |
The original text converted to phonetic transcription |
most_likely_reading | Reading |
"Most likely" reading (may be partially random selection) |
most_likely_lemmas | List[str] |
List of lemma(s) from the "most likely" reading |
transliterate | str |
The original text converted to Romanized Cyrillic (default=Scholarly) |
force_disambiguate | None |
Fully disambiguate readings using methods other than the Constraint Grammar |
cg3_str | str |
Analysis stream in the VISL-CG3 format |
hfst_str | str |
Analysis stream in the XFST/HFST format |
to_dict | dict |
Convert to a dict object |
to_json | str |
Convert to a JSON string |
Property | Type | Description |
---|---|---|
subreadings | List[Subreading] |
Usually only one subreading, but multiple subreadings are possible for complex Token s. |
lemmas | List[str] |
Lemmas from all subreadings |
grouped_tags | List[Tag] |
The part-of-speech, morphosyntactic, semantic and other tags from all subreadings |
weight | str |
Weight indicating the likelihood of the reading, without respect to context |
cg_rule | str |
Reference to the rule in the constraint grammar that removed/selected/etc. this reading. If no action has been taken on this reading, then '' . |
is_most_likely | bool |
Indicates whether this reading has been selected as the most likely reading of its Token . Note that some selection methods may be at least partially random. |
Method | Return type | Description |
---|---|---|
cg3_str | str |
Analysis stream in the VISL-CG3 format |
hfst_str | str |
Analysis stream in the XFST/HFST format |
generate | str |
Generate the wordform from this reading |
replace_tag | None |
Replace a tag in this reading |
does_not_conflict | bool |
Determine whether reading from external tagset (e.g. Universal Dependencies) conflicts with this reading |
to_dict | list |
Convert to a list object |
to_json | str |
Convert to a JSON string |
Property | Type | Description |
---|---|---|
lemma | str |
The lemma of the subreading |
tags | List[Tag] |
The part-of-speech, morphosyntactic, semantic and other tags |
tagset | Set[Tag] |
Same as tags , but for faster membership testing (in Reading) |
Method | Return type | Description |
---|---|---|
cg3_str | str |
Analysis stream in the VISL-CG3 format |
hfst_str | str |
Analysis stream in the XFST/HFST format |
replace_tag | None |
Replace a tag in this reading |
to_dict | dict |
Convert to a dict object |
to_json | str |
Convert to a JSON string |
Property | Type | Description |
---|---|---|
name | str |
The name of this tag |
ms_feat | str |
Morphosyntactic feature that this tag is associated with (e.g. Dat has ms_feat CASE ) |
detail | str |
Description of the tag's purpose or meaning |
is_L2_error | bool |
Whether this tag indicates a second-language learner error |
Method | Return type | Description |
---|---|---|
info | str |
Alias for Tag.detail |
A number of functions are included, both for convenience, and to give concrete examples for using the API.
This function generates all six cases of a given noun. If the given noun is
singular, then the function generates singular forms. If the given noun is
plural, then the function generates plural forms. Such a list can be used in a
multiple-choice exercise, hence the name distractors
.
sg_paradigm = udar.noun_distractors('словом')
print(sg_paradigm == {'сло́ву', 'сло́ве', 'сло́вом', 'сло́ва', 'сло́во'})
# True
pl_paradigm = udar.noun_distractors('словах')
print(pl_paradigm == {'слова́м', 'слова́', 'слова́х', 'слова́ми', 'сло́в'})
# True
If unstressed forms are desired, simply pass the argument stressed=False
.
This function will take a text string as the argument, and will return a dictionary of all the types of L2 errors in the text, along with examples of the error.
diag = udar.diagnose_L2('Етот малчик говорит по-русски.')
print(diag == {'Err/L2_e2je': {'Етот'}, 'Err/L2_NoSS': {'малчик'}})
# True
This function will look up the meaning of any tag used by the analyzer.
print(udar.tag_info('Err/L2_ii'))
# L2 error: Failure to change ending ие to ии in +Sg+Loc or +Sg+Dat, e.g. к Марие, о кафетерие, о знание
The transducers come in two varieties: the Analyzer
class and the Generator
class. For memory efficiency, I recommend using the get_analyzer
and
get_generator
functions, which ensure that each flavor of the transducers
remains a singleton in memory.
The Analyzer
can be initialized with or without analyses for second-language
learner errors using the keyword L2_errors
.
analyzer = udar.get_analyzer() # by default, L2_errors is False
L2_analyzer = udar.get_analyzer(L2_errors=True)
Analyzer
s are callable. They take a token str
and return a sequence of
reading/weight tuple
s.
raw_readings1 = analyzer('сло́ва')
print(raw_readings1)
# (('слово+N+Neu+Inan+Sg+Gen', 5.9755859375),)
raw_readings2 = analyzer('слова')
print(raw_readings2)
# (('слово+N+Neu+Inan+Pl+Acc', 5.9755859375), ('слово+N+Neu+Inan+Pl+Nom', 5.9755859375), ('слово+N+Neu+Inan+Sg+Gen', 5.9755859375))
The Generator
can be initialized in three varieties: unstressed, stressed,
and phonetic.
generator = udar.get_generator() # unstressed by default
stressed_generator = udar.get_generator(stressed=True)
phonetic_generator = udar.get_generator(phonetic=True)
Generator
s are callable. They take a Reading
or raw reading str
and
return a surface form.
print(stressed_generator('слово+N+Neu+Inan+Pl+Nom'))
# слова́
You can easily check if a morphosyntactic tag is in a Token
, Reading
,
or Subreading
using in
:
token2 = udar.Token('слова', analyze=True)
print(token2)
# слова [слово_N_Neu_Inan_Pl_Acc слово_N_Neu_Inan_Pl_Nom слово_N_Neu_Inan_Sg_Gen]
print('Gen' in token2) # do any of the readings include Genitive case?
# True
print('слово' in token2) # does not work for lemmas; use `in Token.lemmas`
# False
print('слово' in token2.lemmas)
# True
You can make a filtered list of a Token
's readings using the following idiom:
pl_readings = [reading for reading in token2 if 'Pl' in reading]
print(pl_readings)
# [Reading(слово+N+Neu+Inan+Pl+Acc, 5.975586, ), Reading(слово+N+Neu+Inan+Pl+Nom, 5.975586, )]
- https://github.com/giellalt/lang-rus (The FSTs underlying this package comes from here)
- https://github.com/mikahama/uralicNLP