imbal/safeyaml

Colons in keys and string values

Opened this issue · 7 comments

Almost every docker-compose.yml in the world has multiple lines like this:

image: postgres:9.3

A lot of Ruby projects contain YAML where strings begin with colons, which is abominable but a fact of life:

:foo: bar

I'm working on a GitLab CI file at the moment which has colons in the middle of keys:

foo:bar: "value"

All of these should be fixed as follows:

image: "postgres:9.3"
":foo": "bar"
"foo:bar": "value"
tef commented
foo:bar: "value"

This is way way harder to disambguate

Actually, I mis-reported what the YAML looks like there - these are keys for sub-maps:

foo:bar:
  a: 1
  b: 2

Which ideally we'd fix like this:

"foo:bar":
  a: 1
  b: 2

Does that make it easier?

tef commented

Yes and No?

:foo:bar:baz

is this ':foo: bar:baz or :foo:bar: baz

If you can define the behaviour, I can implement it, but It's kinda regex dependent

tef commented

At some point it stops being safeyaml and starts being YAML when we can't easily disambiguate keys

tef commented

At the moment, how keys are handled, It scans for a string, in ""s or ''s, or an identifier-like pattern followed by a :, then checks for a .

For values, it tries to read a 'string without special characters', like : etc. When fix-unquoted is given, it doesn't complain.

To quote daniel, "Barewords are a clusterfuck"

{a:1:b:2} This parses under indentifier rules, and skips the after the : (iirc without one of the fix options), but if : is allowed inside keys, well, why not ',' ?

(Aside: If someone uses '0x0FFF' as a number, it won't parse, because barewords currently exclude things with leading digits or +/- as being anything but numbers. )

The change here might also impact on nested keys name: name: foo too.

On the other hand ...

So far, we do this sort of thing

  • check for - a list indent, or [, {,
  • if +-0123456789 then error out, numbers can't be keys or outside of [], or {}'s
  • if a " or ', then scan a string, and expect ': ' or ':\n'
  • scan for an identifier followed by ':'

We could change that last one:

  • split on the last ':' followed by a space or a newline (i.e rightmost)
  • if there's more than one, or less than one, error
  • if it's at end of line, expect an indented block
  • otherwise, key is before ': '

But the weird bit comes into how we handle the value:

on: true/false

If we try parsing the value, we'd stop at true.

And it impacts nesting: - name: value means a - has to do these checks too
This can also impact how things appear inside of {}'s, which sucks.

tef commented

I do feel the whole point of safeyaml is 'stop writing something so ambigious in the first place', but I do feel like a repair option might be necessary

If it makes things easier, image: postgres:9.3 is probably the most common of these by far. Repairing :foo: bar, foo:bar: <scalar>, and foo:bar:\n<indented map/list> is way less important imo.

From what I understand from the scanning logic, to repair that first case, we'd have to include : in the acceptable characters when scanning after encountering <identifier><colon><whitespace>?