christoph2/pyA2L

Autodetection of a2l file encoding is not accurate

still-learnin opened this issue · 4 comments

=============================
if encoding is not None:
warnings.warn("Don't use parameter encoding anymore -- file encoding is autodetected now.", DeprecationWarning, stacklevel = 2)

encoding = detect_encoding(self._a2lfn)

The implication is that the 'encoding=' parameter has been deprecated, however the actual behaviour of the code is to override it. I have an a2l file which I believe is Windows-1252 (or might be ISO-8859-1) generated by a vector tool. However the auto-detection does not work and pya2ldb is unable to parse the file without generating an error.

Is it possible to reinstate the encoding parameter. My understanding is that, in general it is not possible to autodetect text encoding reliably.

I think this is a statistical issue:
I'm using chardet under the hood (like so many other projects) -- you are feeding characters until chardet guesses the encoding with a very high probability; but you may have one TB of finest ASCII text, and at the end a Chinese symbol...
And yes, I'll re-enable the encoding option and ISO-8859-1 is the correct choice for German umlauts.

OK, done. Hope it works.
But there are still some corner cases, like /INCLUDEs with different encodings.

P.S.:
I just started working on a
FAQ document, more complex questions are also highly welcome 🤗, for a prospective HOW-TO.

Unfortunately I cannot check it because another change in the file is causing an issue with the 3.9 version of python that I am using:

AttributeError: module 'time' has no attribute 'clock'

A cursory search seems to indicate that 'clock' was removed in 3.8 since it had platform dependent behaviour.

Sorry for some reason I missed that you removed the obsolete call. Yes this works fine now in my case the problem was with the degree sign, '°'. But all good now. Thanks.