allowed_symbols as a whitelist instead of disallowed_symbols

Question

allowed_symbols as a whitelist instead of disallowed_symbols

stefangrotz opened this issue 5 years ago · 3 comments

When I experimented with the rules for Esperanto it soon became clear that a lot of problems could be avoided by simply excluding most of the extended Latin and Cyrillic alphabet and also some common Chinese signs and Arabic letters. I did this by counting the frequency of letters in all extracted sentences and then add everything I did not want to see to the disallowed_symbols. But this slowed the extraction process down a lot.

It would be much easier to have a whitelist of allowed symbols. This could include the alphabet of the language, a few signs like points, question marks, commas,quotes,... but nothing more. This would regularize the structure of the sentences a lot and could be helpful for many languages.

Answer 1 · 2019-09-02T22:11:15.000Z

Yeah, I also encountered the same problem with Georgian. There were many letters from the Latin, Cyrillic, Chinese alphabets. Georgian has its own alphabet, so we can specify the unicode range in a regex.

Answer 2 · 2019-09-03T18:49:25.000Z

Ah interesting, you can specify an alphabet in Regex, for example \p{InGeorgian}
https://www.regular-expressions.info/unicode.html

Answer 3 · 2019-12-27T15:02:32.000Z

See also common-voice/common-voice#2505 (review) for German.