Wrong UTF-8 detection
cedk opened this issue · 2 comments
cedk commented
When there are not enough non-ascii char, chardet detect UTF-8 as ISO-8859-1
Here is an example:
>>> chardet.detect(u'foo é'.encode('utf-8'))
{'confidence': 0.73, 'language': '', 'encoding': 'ISO-8859-1'}
But with some more non-ascii:
>>> chardet.detect(u'foo é foo é'.encode('utf-8'))
{'confidence': 0.7525, 'language': '', 'encoding': 'utf-8'}
dest81 commented
+1
queengooborg commented
+1 -- related to translate/translate#3827