Config: Force UTF-8 encoding (+ doc)
TheMBadger opened this issue · 6 comments
- ML Launchpad version: ML Launchpad, version 1.0.0
- Model Type used: Python
- DataSource type(s) used: n.a.
- Python version: Python 3.6.10
- Operating System: Windows
Description
We are stringmatching a dictionary that we have stored in a config file (yaml).
We stumpled upon the problem that the diacritics in the yaml file aren't displayed/handled correctly.
We found a solution here and tested this locally by opening a yaml file ourselves (not via launchpad): yaml/pyyaml#123
We think this can be fixed in the mllaunchpad on row 84 in the Config.py file
Could you add an example config file so I can try to reproduce? Please also give the info which encoding the cfg file is saved in (it should be utf-8).
Correction: Does not have to be utf-8. On Windows, saving the file as ISO Latin 1 should work. This might be a workaround.
Could you add an example config file so I can try to reproduce? Please also give the info which encoding the cfg file is saved in (it should be utf-8).
Correction: Does not have to be utf-8. On Windows, saving the file as ISO Latin 1 should work. This might be a workaround.
part_of_config.txt
I attached the file. I deleted any sensitive company data and had to save it as txt
@TheMBadger Thank you for the additional input. I was now able to reproduce, and, if you're okay with that, will put this issue at the top of the prioritized issues https://github.com/schuderer/mllaunchpad/projects/2
As you have read in the issue you linked to, it is Python's default behavior to open files in the operating system's default encoding, which, for Windows, is ISO-Latin-1 (ISO-8859-1). This is, however, confusing, because many Python developers (including me) assume that UTF-8 is the Python default for everything (this is true for many things, with the often surprising exception of opening text files).
Fortunately, this means two things:
- You can get everything to work today by saving your config file as ISO-Latin-1 (ISO-8859-1). It should just work(TM) as a workaround until this issue is done.
- The Python community is already trying to fix this problem: https://www.python.org/dev/peps/pep-0597/
The fix to this issue will be to enforce a default encoding to utf-8, and document this fact.
Thank you for your quick reply. I will try to implement the workaround!
I agree that it will be on top of the priority list, if my assumption was correct that this is a quick win (not too much work)
The workaround was effective by the way!
@TheMBadger Implemented in commit 9446891. Please note that from this version on, you will need to strictly use only UTF-8 encoding everywhere.