single character unicode has the language name prefix

Question

single character unicode has the language name prefix

dsplog opened this issue a year ago · 1 comments

Describe the bug
when using the phonemizer on unicode single characters, the language name is coming as prefix

Phonemizer version
home@home-desktop:$ phonemize --version
phonemizer-3.2.1
available backends: espeak-ng-1.50, espeak-mbrola, festival-2.5.0, segments-2.2.1

System
home@home-desktop:$ uname -a
Linux home-desktop 5.15.0-88-generic #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Python 3.8.18 (default, Sep 11 2023, 13:40:15)
[GCC 11.2.0] :: Anaconda, Inc. on linux

To reproduce

>>> import phonemizer
>>> phon = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True,  with_stress=True,language_switch='remove-flags')
>>> 
>>> text = 'ന'
>>> phon.phonemize([text], strip=True)
['mæleɪˈɑːləmnˈɐ']
>>> 
>>> text = '\u0d28'
>>> phon.phonemize([text], strip=True)
['mæleɪˈɑːləmnˈɐ']

Expected behavior
the prefix 'mæleɪˈɑːləm' is not expected. is there a way to supress it
btw, if i initialize the language as 'ml', the prefix is not there

>>> mlphon = phonemizer.backend.EspeakBackend(language='ml', preserve_punctuation=True,  with_stress=True,language_switch='remove-flags')
>>> mlphon.phonemize([text], strip=True)
['nˈɐ']

Additional context
looks like the language_switch is not taking care of single characters

Answer 1 · 2023-11-24T11:06:16.000Z

Hi, thanks for reporting. Unfortunately this is related to espeak implementation, not phonemizer itself:

$ phonemize --version
phonemizer-3.2.1
available backends: espeak-ng-1.50, espeak-mbrola, festival-2.5.0, segments-2.2.1
$ echo 'ന' | espeak-ng -x -q --ipa -v en-us
mæleɪˈɑːləm(ml)nˈɐ(en-us)
$ echo 'ന' | espeak-ng -x -q --ipa -v ml
nˈɐ
$ echo 'ആനേ' | espeak-ng -x -q --ipa -v en-us
(ml)ˈaːneː(en-us)

I think this is a very special case... if you try with a word the problem is not here. I suggest you to write a custom post-process code, or to play with the regex detecting language-switches here.