Issue with change to chardet
Closed this issue · 6 comments
Overview
A script failed with the new Tabulator 1.38.1 and I wondered why. I narrowed it down to the change from cchardet to chardet. For this file: https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112 cchardet has no issues but chardet gives:
File "/.../tabulator/parsers/csv.py", line 108, in __prepare_dialect
sample.append(next(stream))
File "/usr/lib/python3.8/encodings/cp1254.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 35349: character maps to <undefined>
I saw an issue #265 where someone experienced the opposite: chardet works but not cchardet. Obviously I can set things up to use cchardet, but I'd like to understand a bit better the discrepancies you've found between chardet and cchardet.
Please preserve this line to notify @roll (lead of this repository)
Thanks I'll investigate
@mcarans
I've fixed the size of the sample for detection of remote sources and this now works fine:
$ tabulator 'https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112'
@roll Thanks for fixing. I just wanted to ask about the change "Limit sample size for detection if remote" - if the character that caused the issue with chardet is at the beginning of the file, will there still be a difference of behaviour between chardet and cchardet?
@mcarans
TBH it's very confusing issue so I'm not sure it will be great if we can understand what went wrong and report this to chardet
. Can it be problems with the server (e.g. some weird ending byte)?
Yes it is indeed confusing that it works as a local file but not as a remote url. I can only presume that the sample sent to chardet is different for the local file to the remote url somehow.
@roll, It is odd chardet and cchardet give the same results when tested on the url outside of tabulator:
from urllib.request import urlopen
import chardet
import cchardet
rawdata = urlopen('https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112').read()
print(chardet.detect(rawdata))
print(cchardet.detect(rawdata))
gives:
{'encoding': 'utf-8', 'confidence': 0.7525, 'language': ''}
{'encoding': 'UTF-8', 'confidence': 0.7524999976158142}
I'm not sure how Tabulator prior to your fix was using chardet in such a way that it behaves differently to cchardet on the url so cannot produce a cut down example to report against chardet.