common-voice/cv-sentence-extractor

Combining Acute Accent character refuses to be disabled

iLeonidze opened this issue · 2 comments

I configured disallowed_symbols:

disallowed_symbols = [
  'А́', 'а́', 'Е́', 'е́', 'И́', 'и́', 'О́', 'о́', 'У́', 'у́', 'Ы́', 'ы́', 'Э́',
  'э́', 'Ю́', 'ю́', 'Я́', 'я́', 'З́', 'С́', 'Ѣ', 'І', 'ң', 'ă', 'ĕ', ' ́',
]

But I am still having sentences with bad symbols:

Бо́льшая её часть состоит из низменностей.
В зависимости от стоя́щей задачи, может быть более удобным использовать ту или иную систему.
Каждая из таких областей называется доме́ном.
Баксы́ знали его таинственную силу.

At the same time, other bad symbols are successfully filtered.

I suppose the main reason is symbol ы́ consist of ы and ́, it is one of Combining Diacritical Mark, here it's specification.. I added ́ to disallowed symbols, but this did nothing, the problem is still here.

Possibly found a root-cause: some editors convert character ́ to ́ - they a re not equal. Using vim instead default Ubuntu text editor seems fixed the problem.

Yeah, encoding is hard :/ I'm not sure if there is much we can do here, on the other hand I don't have any deep understanding of unicode, so there might be..