chardet/chardet

Wrong UTF-8 detection

cedk opened this issue · 2 comments

cedk commented

When there are not enough non-ascii char, chardet detect UTF-8 as ISO-8859-1
Here is an example:

>>> chardet.detect(u'foo é'.encode('utf-8'))
{'confidence': 0.73, 'language': '', 'encoding': 'ISO-8859-1'}

But with some more non-ascii:

>>> chardet.detect(u'foo é foo é'.encode('utf-8'))
{'confidence': 0.7525, 'language': '', 'encoding': 'utf-8'}

+1