schuderer/mllaunchpad

Config: Force UTF-8 encoding (+ doc)

TheMBadger opened this issue · 6 comments

  • ML Launchpad version: ML Launchpad, version 1.0.0
  • Model Type used: Python
  • DataSource type(s) used: n.a.
  • Python version: Python 3.6.10
  • Operating System: Windows

Description

We are stringmatching a dictionary that we have stored in a config file (yaml).
We stumpled upon the problem that the diacritics in the yaml file aren't displayed/handled correctly.

We found a solution here and tested this locally by opening a yaml file ourselves (not via launchpad): yaml/pyyaml#123

We think this can be fixed in the mllaunchpad on row 84 in the Config.py file

Could you add an example config file so I can try to reproduce? Please also give the info which encoding the cfg file is saved in (it should be utf-8).

Correction: Does not have to be utf-8. On Windows, saving the file as ISO Latin 1 should work. This might be a workaround.

Could you add an example config file so I can try to reproduce? Please also give the info which encoding the cfg file is saved in (it should be utf-8).

Correction: Does not have to be utf-8. On Windows, saving the file as ISO Latin 1 should work. This might be a workaround.

part_of_config.txt
I attached the file. I deleted any sensitive company data and had to save it as txt

@TheMBadger Thank you for the additional input. I was now able to reproduce, and, if you're okay with that, will put this issue at the top of the prioritized issues https://github.com/schuderer/mllaunchpad/projects/2

As you have read in the issue you linked to, it is Python's default behavior to open files in the operating system's default encoding, which, for Windows, is ISO-Latin-1 (ISO-8859-1). This is, however, confusing, because many Python developers (including me) assume that UTF-8 is the Python default for everything (this is true for many things, with the often surprising exception of opening text files).

Fortunately, this means two things:

  1. You can get everything to work today by saving your config file as ISO-Latin-1 (ISO-8859-1). It should just work(TM) as a workaround until this issue is done.
  2. The Python community is already trying to fix this problem: https://www.python.org/dev/peps/pep-0597/

The fix to this issue will be to enforce a default encoding to utf-8, and document this fact.

Thank you for your quick reply. I will try to implement the workaround!
I agree that it will be on top of the priority list, if my assumption was correct that this is a quick win (not too much work)

The workaround was effective by the way!

@TheMBadger Implemented in commit 9446891. Please note that from this version on, you will need to strictly use only UTF-8 encoding everywhere.