Wrong UTF-8 detection

Question

cedk opened this issue 7 years ago · 2 comments

When there are not enough non-ascii char, chardet detect UTF-8 as ISO-8859-1
Here is an example:

>>> chardet.detect(u'foo é'.encode('utf-8'))
{'confidence': 0.73, 'language': '', 'encoding': 'ISO-8859-1'}

But with some more non-ascii:

>>> chardet.detect(u'foo é foo é'.encode('utf-8'))
{'confidence': 0.7525, 'language': '', 'encoding': 'utf-8'}

+1

Answer 1 · 2018-12-18T08:44:42.000Z