Config: Force UTF-8 encoding (+ doc)

Question

Config: Force UTF-8 encoding (+ doc)

TheMBadger opened this issue 4 years ago · 6 comments

ML Launchpad version: ML Launchpad, version 1.0.0
Model Type used: Python
DataSource type(s) used: n.a.
Python version: Python 3.6.10
Operating System: Windows

Description

We are stringmatching a dictionary that we have stored in a config file (yaml).
We stumpled upon the problem that the diacritics in the yaml file aren't displayed/handled correctly.

We found a solution here and tested this locally by opening a yaml file ourselves (not via launchpad): yaml/pyyaml#123

We think this can be fixed in the mllaunchpad on row 84 in the Config.py file

Answer 1 · 2020-08-26T14:36:21.000Z

Could you add an example config file so I can try to reproduce? Please also give the info which encoding the cfg file is saved in (it should be utf-8).

Correction: Does not have to be utf-8. On Windows, saving the file as ISO Latin 1 should work. This might be a workaround.

Answer 2 · 2020-09-01T07:16:36.000Z

Could you add an example config file so I can try to reproduce? Please also give the info which encoding the cfg file is saved in (it should be utf-8).

Correction: Does not have to be utf-8. On Windows, saving the file as ISO Latin 1 should work. This might be a workaround.

part_of_config.txt
I attached the file. I deleted any sensitive company data and had to save it as txt

Answer 3 · 2020-09-01T08:38:27.000Z

@TheMBadger Thank you for the additional input. I was now able to reproduce, and, if you're okay with that, will put this issue at the top of the prioritized issues https://github.com/schuderer/mllaunchpad/projects/2

As you have read in the issue you linked to, it is Python's default behavior to open files in the operating system's default encoding, which, for Windows, is ISO-Latin-1 (ISO-8859-1). This is, however, confusing, because many Python developers (including me) assume that UTF-8 is the Python default for everything (this is true for many things, with the often surprising exception of opening text files).

Fortunately, this means two things:

You can get everything to work today by saving your config file as ISO-Latin-1 (ISO-8859-1). It should just work(TM) as a workaround until this issue is done.
The Python community is already trying to fix this problem: https://www.python.org/dev/peps/pep-0597/

The fix to this issue will be to enforce a default encoding to utf-8, and document this fact.

Answer 4 · 2020-09-01T08:48:17.000Z

Thank you for your quick reply. I will try to implement the workaround!
I agree that it will be on top of the priority list, if my assumption was correct that this is a quick win (not too much work)

Answer 5 · 2020-09-03T07:42:58.000Z

The workaround was effective by the way!

Answer 6 · 2021-10-18T19:01:47.000Z

@TheMBadger Implemented in commit 9446891. Please note that from this version on, you will need to strictly use only UTF-8 encoding everywhere.