Kozea/Pyphen

Hungarian hyphenation is faulty in case of vowel-consonant-vowel-* words

aswna opened this issue · 3 comments

aswna commented

Hello,
using latest pyphen (0.14.0), there seems to be an issue with the hyphenation of Hungarian words starting as vowel-consonant-vowel-*. E.g.: "alak" should be hyphenated as "a-lak" (currently not hyphenated by pyphen), or "alaktalan" as "a-lak-ta-lan" (incorrectly hyphenated as "alak-ta-lan" by pyphen).

I saw you suggested here to check with https://www.ushuaia.pl/hyphen/?ln=en (selecting language: Hungarian). The hyphenation of these type of words are also faulty there. Also checked these words in LibreOffice (7.3.7.2), it has the same issue.

Notes:

  • have not checked previous releases, just started to try pyphen
  • could not find any other type of words incorrectly hyphenated, e.g. "láda", "labda", "drága" are all hyphenated correctly ("lá-da", "lab-da", "drá-ga") by these tools.

What should be used for cross-checking instead of the above is https://helyesiras.mta.hu/helyesiras/default/hyph# .
Note: MTA (mta.hu) is the National Academy of Science in Hungary.
The hyphenations I checked here were all correct, including the above words, too.

Thanks for looking into this!

liZe commented

Hi!

Thanks for this report.

I saw you suggested here to check with https://www.ushuaia.pl/hyphen/?ln=en (selecting language: Hungarian). The hyphenation of these type of words are also faulty there. Also checked these words in LibreOffice (7.3.7.2), it has the same issue.

Then it means that the problem (or maybe it’s a known limitation) comes from the dictionary. The best way to solve this is to talk with the authors of the dictionary, you’ll find more information about them in this file.

What should be used for cross-checking instead of the above is https://helyesiras.mta.hu/helyesiras/default/hyph# .

I propose to use ushuaia.pl because it uses the same dictionary as Pyphen but not the same code. So, if users have the same problem with Pyphen and ushuaia.pl, it means that the problem is in the dictionary (that we don’t maintain, and that we can’t fix), and not in the code (that we maintain and can fix.)

aswna commented

For the record: this behavior is due to the following note about hyphenation in the Hungarian spelling/grammatical rule book

"Az egyetlen magánhangzóból álló szókezdő és szó végi szótagot – bár önállóságát nyelvi tekintetben nem lehet elvitatni – esztétikai okokból nem szokás egymagában a sor végén hagyni, illetőleg a következő sorba átvinni" -- https://helyesiras.mta.hu/helyesiras/default/akh12#F8 (chapter 226.).

Meaning that although it is correct, it is not "nice" (in a text) to have a single vowel at the end of the line, or at the start of the (new) line.

Contacted the authors, who confirmed that this "hyph" dictionary in itself is not completely suitable for finding all the hyphenations.

László Németh suggested below "workaround", which works for simple cases:

$ /home/laci/libreoffice/workdir/UnpackedTarball/hyphen/example /home/laci/libreoffice/dictionaries/hu_HU/hyph_hu_HU.dic /dev/stdin | sed 's/^([aáeéiíoóöőuúüű])(([^aáeéiíoóöőuúüű]|cs|gy|ny|sz|ty|zs)?[aáeéiíoóöőuúüű])/\1=\2/;s/([aáeéiíoóöőuúüű])([aáeéiíoóöőuúüű])$/\1=\2/'
agyabugyál
Fehérlófia
a=gya=bu=gyál fe=hér=ló=fi=a

He also noted, that the above does not handle compound words, but utilizing Hunspell's morphology analysis (see "st:" and "pa:") the above can be extended:

$ hunspell -m
elagyabugyál
elagyabugyál ip:PREF sp:el st:agyabugyál po:vrb ts:PRES_INDIC_INDEF_SG_3

szappanopera
szappanopera pa:szappan st:szappan po:noun ts:NOM pa:opera st:opera po:noun ts:NOM

Note: the correct hyphenations for the above are "el-a-gya-bu-gyál", "szap-pan-o-pe-ra" (and not "e-la-gya-bu-gyál" and "szap-pa-no-pe-ra").

aswna commented

Thanks!