Vowel symbols are removed from Devanagari (Hindi) scripts
taku910 opened this issue · 1 comments
taku910 commented
This is somehow related to this issue.
In Devanagari, a vowel combines with a consonant to form a compound letter. However, the current tokenizer removes some vowel symbols and only the consonant symbol remains.
Example: कु
(k + u) → क
(k, ु is removed)
I've not checked everything, but languages in Brahmi script family would have the same issue.
jacobdevlin-google commented
Yes, as I mentioned in the multilingual.md the normalization may cause ambiguities in certain languages.
I will try to train another multilingual model does not do the lower casing+NFC normalization+accent removal but I can't give a promise on the date it it will be available.