yaml.load does not support encodings different from current system encoding, cannot you add it?
Closed this issue ยท 8 comments
Hi folks!
We try to use PyYaml in Windows with UTF-8 yaml files. Alas, yaml.load
raises an error: it does not support encoding different from system one (in Windows it is CP-1251). Can you add such a feature to manually set the encoding in which the yaml file is?
The traceback, if needed:
Traceback (most recent call last):
File "D:/Projects/bricks2/main.py", line 45, in <module>
main_wnd.load_components()
File "D:\Projects\bricks2\bricks\gui\main_wnd.py", line 286, in load_components
self.registry.load()
File "D:\Projects\bricks_cli\bricks_cli\registry.py", line 38, in load
self._load_config(root_node, config)
File "D:\Projects\bricks_cli\bricks_cli\registry.py", line 44, in _load_config
config_obj = yaml.load(open(config, 'r'))
File "C:\Python35\lib\site-packages\yaml\__init__.py", line 73, in load
loader = Loader(stream)
File "C:\Python35\lib\site-packages\yaml\loader.py", line 24, in __init__
Reader.__init__(self, stream)
File "C:\Python35\lib\site-packages\yaml\reader.py", line 85, in __init__
self.determine_encoding()
File "C:\Python35\lib\site-packages\yaml\reader.py", line 124, in determine_encoding
self.update_raw()
File "C:\Python35\lib\site-packages\yaml\reader.py", line 178, in update_raw
data = self.stream.read(size)
File "C:\Python35\lib\encodings\cp1251.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 2574: character maps to <undefined>
The yaml.load() method takes an open file object. You must set the encoding when you open the file. This does not have anything to do with PyYAML. Your code contains
config_obj = yaml.load(open(config, 'r'))
I would suggest to change this to
with open(config, 'rt', encoding='utf8') as yml:
config_obj = yaml.load(yml)
PS: I did not test this code, but it (or something close to it) should work on Python3. If you are still on python2 you can import codecs
and use codecs.open
.
I suggest to close this issue
rt mode are not needed explicitly as they are the default options.
https://docs.python.org/3/library/functions.html#open
@Felix-neko if your question is not answered by @TormodLandet then please reopen.
In case anyone finds this thread, thinking PyYaml is the problem:
Run python with the -X utf8
option. python -X utf8 .\script.py
should do the trick.
It's just Windows being poopy, in my case, as I even used encoding='utf8'
in my open()
. Stupid Windows kept using cp1252.py
, which caused a UnicodeEncodeError
:/
I would suggest to change this to
with open(config, 'rt', encoding='utf8') as yml: config_obj = yaml.load(yml)
Incase of having !!python/tuple
in the yaml file, I can't apply utf-8 encoding anymore.
~\anaconda3\lib\site-packages\yaml\constructor.py in construct_undefined(self, node)
425
426 def construct_undefined(self, node):
--> 427 raise ConstructorError(None, None,
428 "could not determine a constructor for the tag %r" % node.tag,
429 node.start_mark)
ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/tuple'
in "tmp.yaml", line 4, column 5
Any suggestion!?
In case anyone finds this thread, thinking PyYaml is the problem:
Run python with the
-X utf8
option.python -X utf8 .\script.py
should do the trick.It's just Windows being poopy, in my case, as I even used
encoding='utf8'
in myopen()
. Stupid Windows kept usingcp1252.py
, which caused aUnicodeEncodeError
:/
Thanks. Solved my problem of accents returning weird characters :-)
This is not the correct answer, however.
Windows uses UTF-8 if you open the file with that encoding.
The issue arises when you use a different encoding for the file (other than UTF-8). The correct question is
- whether PyYAML (specifically
safe_load
andload
) can handle different encodings (e.g., UTF-16, UTF-32, possibly not CP1252 if it's not YAML specification-compliant) or or if it only handles UTF-8.
The correct answer is that the YAML specification itself does not support encodings like CP-1252 or CP-1251, rather than this being an issue with PyYAML.
What PyYAML could do is implement a custom check for invalid string delimiters like curly quotes, which are valid UTF-8 characters but not valid YAML string delimiters. This issue, highlighted in #800, can result in exceptions like UnicodeDecodeError when the YAML file is not opened with UTF-8 encoding on Windows. However, in certain contexts, the exception might be preferred over incorrect YAML content, which could include these erroneous curly quotes.