yaml/pyyaml

yaml.load does not support encodings different from current system encoding, cannot you add it?

Closed this issue ยท 8 comments

Hi folks!

We try to use PyYaml in Windows with UTF-8 yaml files. Alas, yaml.load raises an error: it does not support encoding different from system one (in Windows it is CP-1251). Can you add such a feature to manually set the encoding in which the yaml file is?

The traceback, if needed:

Traceback (most recent call last):
  File "D:/Projects/bricks2/main.py", line 45, in <module>
    main_wnd.load_components()
  File "D:\Projects\bricks2\bricks\gui\main_wnd.py", line 286, in load_components
    self.registry.load()
  File "D:\Projects\bricks_cli\bricks_cli\registry.py", line 38, in load
    self._load_config(root_node, config)
  File "D:\Projects\bricks_cli\bricks_cli\registry.py", line 44, in _load_config
    config_obj = yaml.load(open(config, 'r'))
  File "C:\Python35\lib\site-packages\yaml\__init__.py", line 73, in load
    loader = Loader(stream)
  File "C:\Python35\lib\site-packages\yaml\loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "C:\Python35\lib\site-packages\yaml\reader.py", line 85, in __init__
    self.determine_encoding()
  File "C:\Python35\lib\site-packages\yaml\reader.py", line 124, in determine_encoding
    self.update_raw()
  File "C:\Python35\lib\site-packages\yaml\reader.py", line 178, in update_raw
    data = self.stream.read(size)
  File "C:\Python35\lib\encodings\cp1251.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 2574: character maps to <undefined>

The yaml.load() method takes an open file object. You must set the encoding when you open the file. This does not have anything to do with PyYAML. Your code contains

config_obj = yaml.load(open(config, 'r'))

I would suggest to change this to

with open(config, 'rt', encoding='utf8') as yml:
    config_obj = yaml.load(yml)

PS: I did not test this code, but it (or something close to it) should work on Python3. If you are still on python2 you can import codecs and use codecs.open.

I suggest to close this issue

rt mode are not needed explicitly as they are the default options.
https://docs.python.org/3/library/functions.html#open

@Felix-neko if your question is not answered by @TormodLandet then please reopen.

In case anyone finds this thread, thinking PyYaml is the problem:

Run python with the -X utf8 option. python -X utf8 .\script.py should do the trick.

It's just Windows being poopy, in my case, as I even used encoding='utf8' in my open(). Stupid Windows kept using cp1252.py, which caused a UnicodeEncodeError :/

I would suggest to change this to

with open(config, 'rt', encoding='utf8') as yml:
    config_obj = yaml.load(yml)

Incase of having !!python/tuple in the yaml file, I can't apply utf-8 encoding anymore.

~\anaconda3\lib\site-packages\yaml\constructor.py in construct_undefined(self, node)
    425 
    426     def construct_undefined(self, node):
--> 427         raise ConstructorError(None, None,
    428                 "could not determine a constructor for the tag %r" % node.tag,
    429                 node.start_mark)

ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/tuple'
  in "tmp.yaml", line 4, column 5

Any suggestion!?

In case anyone finds this thread, thinking PyYaml is the problem:

Run python with the -X utf8 option. python -X utf8 .\script.py should do the trick.

It's just Windows being poopy, in my case, as I even used encoding='utf8' in my open(). Stupid Windows kept using cp1252.py, which caused a UnicodeEncodeError :/

Thanks. Solved my problem of accents returning weird characters :-)

This is not the correct answer, however.
Windows uses UTF-8 if you open the file with that encoding.
The issue arises when you use a different encoding for the file (other than UTF-8). The correct question is

The correct answer is that the YAML specification itself does not support encodings like CP-1252 or CP-1251, rather than this being an issue with PyYAML.

What PyYAML could do is implement a custom check for invalid string delimiters like curly quotes, which are valid UTF-8 characters but not valid YAML string delimiters. This issue, highlighted in #800, can result in exceptions like UnicodeDecodeError when the YAML file is not opened with UTF-8 encoding on Windows. However, in certain contexts, the exception might be preferred over incorrect YAML content, which could include these erroneous curly quotes.