chrislit/abydos

Empty output for russian in BeiderMorse encoding

Closed this issue · 2 comments

from abydos.phonetic import BeiderMorse
BeiderMorse('russian').encode('привет')

Outputs:

But if you put cyrillic insted of russian, it seems to work well:

BeiderMorse('cyrillic').encode('привет')

Outputs:

privit,prQvit

This is the expected behavior (based on the reference implementation). It's good to keep in mind that Beider-Morse is mostly intended for names and genealogy, especially Jewish names. So, to the Beider-Morse algorithm, 'russian' isn't necessarily intended for Russian words, but for surnames of Russian emigrants who moved to nations that use the Latin alphabet. So 'russian' expects input transliterated to Latin. But the 'cyrillic' mode is specifically intended for Russian surnames written in Cyrillic. (You can read about this at https://stevemorse.org/phonetics/bmpm.htm).

With that in mind, you can get the expected output by transliterating:

In [1]: from abydos.phonetic import *
In [2]: BeiderMorse('russian').encode('privet')                                 
Out[2]: 'privit,prQvit'

Or, you can use the 'cyrillic' setting, knowing that it is Russian-specific.

I added some explanations within the BeiderMorse() constructor that might make the script/language distinction more clear in the 6 cases of non-Latin scripts or Latin transliteration. (bc257b7)