frictionlessdata/tabulator-py

Issue with change to chardet

Closed this issue · 6 comments

Overview

A script failed with the new Tabulator 1.38.1 and I wondered why. I narrowed it down to the change from cchardet to chardet. For this file: https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112 cchardet has no issues but chardet gives:

  File "/.../tabulator/parsers/csv.py", line 108, in __prepare_dialect
    sample.append(next(stream))
  File "/usr/lib/python3.8/encodings/cp1254.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 35349: character maps to <undefined>

I saw an issue #265 where someone experienced the opposite: chardet works but not cchardet. Obviously I can set things up to use cchardet, but I'd like to understand a bit better the discrepancies you've found between chardet and cchardet.


Please preserve this line to notify @roll (lead of this repository)

roll commented

Thanks I'll investigate

roll commented

@mcarans
I've fixed the size of the sample for detection of remote sources and this now works fine:

$ tabulator 'https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112'

@roll Thanks for fixing. I just wanted to ask about the change "Limit sample size for detection if remote" - if the character that caused the issue with chardet is at the beginning of the file, will there still be a difference of behaviour between chardet and cchardet?

roll commented

@mcarans
TBH it's very confusing issue so I'm not sure it will be great if we can understand what went wrong and report this to chardet. Can it be problems with the server (e.g. some weird ending byte)?

Yes it is indeed confusing that it works as a local file but not as a remote url. I can only presume that the sample sent to chardet is different for the local file to the remote url somehow.

@roll, It is odd chardet and cchardet give the same results when tested on the url outside of tabulator:

from urllib.request import urlopen
import chardet
import cchardet

rawdata = urlopen('https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112').read()
print(chardet.detect(rawdata))
print(cchardet.detect(rawdata))

gives:

{'encoding': 'utf-8', 'confidence': 0.7525, 'language': ''}
{'encoding': 'UTF-8', 'confidence': 0.7524999976158142}

I'm not sure how Tabulator prior to your fix was using chardet in such a way that it behaves differently to cchardet on the url so cannot produce a cut down example to report against chardet.