Combining Acute Accent character refuses to be disabled

Question

Combining Acute Accent character refuses to be disabled

iLeonidze opened this issue 5 years ago · 2 comments

I configured disallowed_symbols:

disallowed_symbols = [
  'А́', 'а́', 'Е́', 'е́', 'И́', 'и́', 'О́', 'о́', 'У́', 'у́', 'Ы́', 'ы́', 'Э́',
  'э́', 'Ю́', 'ю́', 'Я́', 'я́', 'З́', 'С́', 'Ѣ', 'І', 'ң', 'ă', 'ĕ', ' ́',
]

But I am still having sentences with bad symbols:

Бо́льшая её часть состоит из низменностей.
В зависимости от стоя́щей задачи, может быть более удобным использовать ту или иную систему.
Каждая из таких областей называется доме́ном.
Баксы́ знали его таинственную силу.

At the same time, other bad symbols are successfully filtered.

I suppose the main reason is symbol ы́ consist of ы and ́, it is one of Combining Diacritical Mark, here it's specification.. I added ́ to disallowed symbols, but this did nothing, the problem is still here.

Answer 1 · 2020-01-03T13:29:07.000Z

Possibly found a root-cause: some editors convert character ́ to ́ - they a re not equal. Using vim instead default Ubuntu text editor seems fixed the problem.

Answer 2 · 2020-01-03T17:45:31.000Z

Yeah, encoding is hard :/ I'm not sure if there is much we can do here, on the other hand I don't have any deep understanding of unicode, so there might be..