Combining Acute Accent character refuses to be disabled
iLeonidze opened this issue · 2 comments
iLeonidze commented
I configured disallowed_symbols:
disallowed_symbols = [
'А́', 'а́', 'Е́', 'е́', 'И́', 'и́', 'О́', 'о́', 'У́', 'у́', 'Ы́', 'ы́', 'Э́',
'э́', 'Ю́', 'ю́', 'Я́', 'я́', 'З́', 'С́', 'Ѣ', 'І', 'ң', 'ă', 'ĕ', ' ́',
]
But I am still having sentences with bad symbols:
Бо́льшая её часть состоит из низменностей.
В зависимости от стоя́щей задачи, может быть более удобным использовать ту или иную систему.
Каждая из таких областей называется доме́ном.
Баксы́ знали его таинственную силу.
At the same time, other bad symbols are successfully filtered.
I suppose the main reason is symbol ы́
consist of ы
and ́
, it is one of Combining Diacritical Mark, here it's specification.. I added ́
to disallowed symbols, but this did nothing, the problem is still here.
iLeonidze commented
Possibly found a root-cause: some editors convert character ́
to ́
- they a re not equal. Using vim instead default Ubuntu text editor seems fixed the problem.
MichaelKohler commented
Yeah, encoding is hard :/ I'm not sure if there is much we can do here, on the other hand I don't have any deep understanding of unicode, so there might be..