google-research/bert

Vowel symbols are removed from Devanagari (Hindi) scripts

taku910 opened this issue · 1 comments

This is somehow related to this issue.

In Devanagari, a vowel combines with a consonant to form a compound letter. However, the current tokenizer removes some vowel symbols and only the consonant symbol remains.

Example: कु (k + u) → (k, ु is removed)

I've not checked everything, but languages in Brahmi script family would have the same issue.

Yes, as I mentioned in the multilingual.md the normalization may cause ambiguities in certain languages.

I will try to train another multilingual model does not do the lower casing+NFC normalization+accent removal but I can't give a promise on the date it it will be available.