yaml/pyyaml

Numbers in scientific notation without dot are parsed as string

coldfix opened this issue ยท 13 comments

Hi,

it appears that PyYAML requires a dot in the mantissa as well as a sign in the exponent in order to parse numbers in scientific notation as numbers and otherwise parses them as string:

>>> import yaml
>>> yaml.safe_load('1e+10')
'1e+10'
>>> yaml.safe_load('1.e+10')
10000000000.0
>>> yaml.safe_load('1.0e10')
'1.0e10'

According to YAML 1.2 spec the dot is optional as in JSON:

Either 0, .inf, -.inf, .nan, or scientific notation matching the regular expression -? [1-9] ( \. [0-9]* [1-9] )? ( e [-+] [1-9] [0-9]* )?.

Furthermore, JSON makes even the sign in the exponent optional: https://www.json.org/number.gif

Since YAML 1.2 aspires to be a superset of JSON, I believe it should make the sign optional as well.

Others than ran into similar problems:

https://stackoverflow.com/questions/30458977/yaml-loads-5e-6-as-string-and-not-a-number

Best, Thomas

Since YAML 1.2 aspires to be a superset of JSON, I believe it should make the sign optional as well.

The sign is optional.
This is the regex for the YAML JSON Schema:
-? ( 0 | [1-9] [0-9]* ) ( \. [0-9]* )? ( [eE] [-+]? [0-9]+ )?
And this one for Core Schema:
[-+]? ( \. [0-9]+ | [0-9]+ ( \. [0-9]* )? ) ( [eE] [-+]? [0-9]+ )?

You were quoting the canonical form.

Oh, my mistake. So I guess both changes are justified.

I ran into the same issue (which I consider a bug). The stackoverflow solution provides a simple fix that I hope will find its way upstream sooner than later (unfortunately it was not adopted in the step from 3.12 to 5.1)

For anyone running into similar issues, the derived package ruamel.yaml resolves issues like the one stated above.

It seems like it's not just the presence or absence or a decimal point that affects things, but also whether there is an explicit sign (positive or negative) preceeding the exponent.

I'm unsure if this is also addressed by PR #174 or if it could be included.

In [1]: import yaml

In [2]: yaml_str = ('''
   ...: hello:
   ...:   - 5.0E6
   ...:   - 5.0e6
   ...:   - 5.E6
   ...:   - 5.e6
   ...:   - 5E6
   ...:   - 5e6
   ...: ''')

In [3]: yaml.safe_load(yaml_str)
Out[3]: {'hello': ['5.0E6', '5.0e6', '5.E6', '5.e6', '5E6', '5e6']}

In [4]: yaml_str = ('''
   ...: hello:
   ...:   - 5.0E-6
   ...:   - 5.0e-6
   ...:   - 5.E-6
   ...:   - 5.e-6
   ...:   - 5E-6
   ...:   - 5e-6
   ...: ''')

In [5]: yaml.safe_load(yaml_str)
Out[5]: {'hello': [5e-06, 5e-06, 5e-06, 5e-06, '5E-6', '5e-6']}

In [6]: yaml_str = ('''
   ...: hello:
   ...:   - 5.0E+6
   ...:   - 5.0e+6
   ...:   - 5.E+6
   ...:   - 5.e+6
   ...:   - 5E+6
   ...:   - 5e+6
   ...: ''')

In [7]: yaml.safe_load(yaml_str)
Out[7]: {'hello': [5000000.0, 5000000.0, 5000000.0, 5000000.0, '5E+6', '5e+6']}

Reference to original discussion: microscopium/microscopium#159 (comment)

Issue is still present, I use the fix found on Stackoverflow, manually adding the new implicit resolver.

I stumbled into this one today. we use a python script to alter a kubernetes manifest. it uses yaml.safe_load() to read in the manifest file, alters the dictionary and yal.dump() 's it back to the yaml. Today that manifest had an annotation of githash: "3900e128", it loaded into the script properly but when it was dumped back out it became githash: 3900e128 which kubectl then read in as serviceTag: 3.9e+131

The following is a trimmed down recreation of what happens

#/usr/bin/env python
import yaml

data = yaml.safe_load("githash: '3900e128'")
print(data)
print(yaml.dump(data))

pyyaml appears to be considering this as a string and therefore removing the quotes and the output yaml ends up being
githash: 3900e128

when it should be since it was quoted in the origional yaml file.
githash: "3900e128"

I can work around it using the following

#/usr/bin/env python
import yaml


def quoted_presenter(dumper, data):
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')

yaml.add_representer(str, quoted_presenter)

data = yaml.safe_load("githash: '3900e128'")
print(data)
print(yaml.dump(data))

which outputs
"githash": "3900e128"

Since pyyaml implements YAML 1.1, this behaviour is actually correct.
It would be better to implement resolvers and representers for YAML 1.2 and let users choose the version.
I had already implemented the resolvers in a proof of concept.
It still depends on @ingydotnet 's plans about the safe load issue, because my changes will need to rearrange the methods a bit and might conflict with those changes.

It still depends on @ingydotnet 's plans about the safe load issue

It appears @ingydotnet has not committed to this repository in 3 months now, and there are a number of increasingly pressing quality-of-life issues stemming from the lack of YAML 1.2 support. What is the strategy for landing YAML 1.2 support in PyYAML?

I'm not speaking for everyone, but IMO: it'll land when it's ready. There are no paid contributors or maintainers on this project, and since this is generally considered the "stable" Python YAML parser, if the API changes required to support YAML 1.2 break the world, it's going to be hell for everyone involved (even if we have "well, it's a major release, just deal" to hide behind).

@perlpunk has been doing some great work recently to get the functionality in there, and I'm still trying to find some time to really sit down and bash around with it to see what we can do to preserve backward compatibility for at least the common use cases.

@nitzmahone thanks for that context, and thanks to @perlpunk for all the hard work as well. The main reason I was asking was to see if the release planning process was stuck waiting for input from an absent contributor. Is there anything specific that external contributors can do to speed up the process (PRs that could be accepted for specific features or refactors)? Even an interim release where the 1.2 grammar is available, but off by default, would be terrific.

@kislyuk

Even an interim release where the 1.2 grammar is available, but off by default, would be terrific.

That's the current plan. I think defaulting to 1.2, if ever, needs to wait for a while. It would break a lot of stuff =)

I'm trying to figure out a syntax with which people can use the 1.2 schema, but without adding too much classes into pyyaml itself. #512 is a draft that adds a lot of classes, but I'm working on top of that branch for an alternative.

You say "1.2 grammar". Do you think of something specific in the grammar? Because I'm really only working on the schema. #512 has some details on that.

edit: fixed PR number

Superseded by #555.

Edit: Sorry, wanted to close the associated PR instead.