jfilter/clean-text

Incorrect removal of accents when preceded by a capital letter

ViliamVadocz opened this issue · 3 comments

>>> import cleantext
>>> cleantext.clean("všetko")
'vsetko'
>>> cleantext.clean("Všetko") 
'va!etko'
>>> cleantext.__version__
'0.4.0'

unidecode does not have this issue, so something must be happening before.

>>> import unidecode
>>> unidecode.unidecode("všetko")
'vsetko'
>>> unidecode.unidecode("Všetko") 
'Vsetko'

Issue seems to be with the fix_bad_unicode function.

>>> from cleantext import fix_bad_unicode
>>> fix_bad_unicode("všetko") 
'všetko'
>>> fix_bad_unicode("Všetko") 
'VÅ¡etko'

Thanks for reporting! This is fixed in the upcoming release.