common-voice/cv-sentence-extractor

allowed_symbols as a whitelist instead of disallowed_symbols

stefangrotz opened this issue ยท 3 comments

When I experimented with the rules for Esperanto it soon became clear that a lot of problems could be avoided by simply excluding most of the extended Latin and Cyrillic alphabet and also some common Chinese signs and Arabic letters. I did this by counting the frequency of letters in all extracted sentences and then add everything I did not want to see to the disallowed_symbols. But this slowed the extraction process down a lot.

It would be much easier to have a whitelist of allowed symbols. This could include the alphabet of the language, a few signs like points, question marks, commas,quotes,... but nothing more. This would regularize the structure of the sentences a lot and could be helpful for many languages.

AG12r commented

Yeah, I also encountered the same problem with Georgian. There were many letters from the Latin, Cyrillic, Chinese alphabets. Georgian has its own alphabet, so we can specify the unicode range in a regex.

Ah interesting, you can specify an alphabet in Regex, for example \p{InGeorgian}
https://www.regular-expressions.info/unicode.html