single character unicode has the language name prefix
dsplog opened this issue · 1 comments
Describe the bug
when using the phonemizer on unicode single characters, the language name is coming as prefix
Phonemizer version
home@home-desktop:$ phonemize --version
phonemizer-3.2.1
available backends: espeak-ng-1.50, espeak-mbrola, festival-2.5.0, segments-2.2.1
System
home@home-desktop:$ uname -a
Linux home-desktop 5.15.0-88-generic #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Python 3.8.18 (default, Sep 11 2023, 13:40:15)
[GCC 11.2.0] :: Anaconda, Inc. on linux
To reproduce
>>> import phonemizer
>>> phon = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True, with_stress=True,language_switch='remove-flags')
>>>
>>> text = 'ന'
>>> phon.phonemize([text], strip=True)
['mæleɪˈɑːləmnˈɐ']
>>>
>>> text = '\u0d28'
>>> phon.phonemize([text], strip=True)
['mæleɪˈɑːləmnˈɐ']
Expected behavior
the prefix 'mæleɪˈɑːləm' is not expected. is there a way to supress it
btw, if i initialize the language as 'ml', the prefix is not there
>>> mlphon = phonemizer.backend.EspeakBackend(language='ml', preserve_punctuation=True, with_stress=True,language_switch='remove-flags')
>>> mlphon.phonemize([text], strip=True)
['nˈɐ']
Additional context
looks like the language_switch is not taking care of single characters
Hi, thanks for reporting. Unfortunately this is related to espeak implementation, not phonemizer itself:
$ phonemize --version
phonemizer-3.2.1
available backends: espeak-ng-1.50, espeak-mbrola, festival-2.5.0, segments-2.2.1
$ echo 'ന' | espeak-ng -x -q --ipa -v en-us
mæleɪˈɑːləm(ml)nˈɐ(en-us)
$ echo 'ന' | espeak-ng -x -q --ipa -v ml
nˈɐ
$ echo 'ആനേ' | espeak-ng -x -q --ipa -v en-us
(ml)ˈaːneː(en-us)
I think this is a very special case... if you try with a word the problem is not here. I suggest you to write a custom post-process code, or to play with the regex detecting language-switches here.