асфальтны is not analyzed correctly
mansayk opened this issue · 16 comments
According to Tatar orthographical dictionary it should be "асфальтны", not "асфальтне":
http://suzlek.antat.ru/words.php?txtW=%D0%B0%D1%81%D1%84%D0%B0%D0%BB%D1%8C%D1%82&submit=%D0%AD%D0%B7%D0%BB%D3%99%D2%AF
echo "асфальтны" | apertium-destxt -n | lt-proc -z -w 'apertium-tat/tat.automorf.bin' | cg-proc -z 'apertium-tat/tat.rlx.bin' | cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' | apertium-retxt
^асфальтны/*асфальтны$
echo "асфальтне" | apertium-destxt -n | lt-proc -z -w 'apertium-tat/tat.automorf.bin' | cg-proc -z 'apertium-tat/tat.rlx.bin' | cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' | apertium-retxt
^асфальтне/асфальт<n><sg><acc>$
The same thing here:
echo "ательены" | apertium-destxt -n | lt-proc -z -w 'apertium-tat/tat.automorf.bin' | cg-proc -z 'apertium-tat/tat.rlx.bin' | cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' | apertium-retxt
^ательены/*ательены$
root@apertium:~# echo "ательене" | apertium-destxt -n | lt-proc -z -w 'apertium-tat/tat.automorf.bin' | cg-proc -z 'apertium-tat/tat.rlx.bin' | cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' | apertium-retxt
^ательене/ателье<n><sg><acc>$
According to Tatar orthographical dictionary it should be "асфальтны", not "асфальтне":
So we should definitely generate асфальтны, but should we analyse both forms? That is, is асфальтне attested commonly enough?
(Btw, the dictionary link doesn't show any relevant information when I click on it.)
Also, can you confirm how nouns that end in ль behave, like роль, руль, автомобиль? What about words that end in бль, like рубль, ансамбль, etc.?
but should we analyse both forms? That is, is асфальтне attested commonly enough?
Some people of course can write "асфальтне", but it will be spelling mistake. If we analyze both forms, than it will also affect apertium's spellchecker.
Although that spellchecker doesn't already work as expected because of many archaic and dialect words in the dictionary, that's why I think we should add some 'Orth' tag for "good" words in the dictionary and spellchecker would use only them...
Maybe here we should analyze both forms but add some additional tag that means that it is not orthographically correct. If I remember correctly @IlnarSelimcan already used one a couple of times...
Also, can you confirm how nouns that end in ль behave, like роль, руль, автомобиль? What about words that end in бль, like рубль, ансамбль, etc.?
Most of them have affixes with front vowels, but there might be exceptions. For example, correct ones:
рольдән
рульдән
автомобильдән
ансамбльдән
but
акропольдан (I don't know why, but http://suzlek.antat.ru/words.php?txtW=%D0%B0%D0%BA%D1%80%D0%BE%D0%BF%D0%BE%D0%BB%D1%8C&submit=%D0%AD%D0%B7%D0%BB%D3%99%D2%AF)
And some more:
фасоль, фасолена
декольте, декольтесы
кольт, кольты
вольт, вольты
^ательены/*ательены$
Do Russian words ending in ‹е› generally take back vowel endings? That is, is this part of a larger pattern, or is it an exception?
Related issue: we have the lexicon set up to do both ноябрьдә and ноябрьда. Which is correct?
Also, is it январенда or январендә? Once I got фасоленда working, январендә is now being produced as январенда. I'll hack it to only work with оль words for now, but this will need to be investigated.
I think we should add some 'Orth' tag for "good" words in the dictionary and spellchecker would use only them...
Actually, we do the reverse. We add a tag <err_orth>
for words that are attested but are considered orthographic errors, and we just automatically remove them when we generate the spell checker. So what we want (and as of eb360c7 now get) is the following:
$ echo "асфальтны" | apertium -d . tat-morph
^асфальтны/асфальт<n><acc>$^./.<sent>$
$ echo "асфальтне" | apertium -d . tat-morph
^асфальтне/асфальт<n><acc><err_orth>$^./.<sent>$
Have a look at the commit—with knowledge of how the word-class categorisation works, it's pretty simple to do for many words.
"Акрополь" is strange. You can search for that word here:
http://suzlek.antat.ru
And it finds it.
According to the aforementioned website the correct one is "ноябрьдә".
And also it says, the correct one is "январенда".
"фасоль"
- correct "фасолена" according to orthographical dictionary.
- correct "фасольгә" according to explanatory dictionary.
So, it turned out both of them can be treated as correct?
Do Russian words ending in ‹е› generally take back vowel endings? That is, is this part of a larger pattern, or is it an exception?
I cannot right now say it explicitly, but I think you are right. All words that came to my mind have endings with back vowels: ришельесы, ательесы, льесы, подпольесы.