yaml/libyaml

%TAG prefix does not accept all characters in ns-uri-char production

gkellogg opened this issue · 2 comments

As noted in yaml/yaml-spec#268 (comment), Psych does not accept a %TAG prefix including a #, which seems to be due to the following code:

libyaml/src/scanner.c

Lines 2603 to 2627 in f8f760f

/*
* The set of characters that may appear in URI is as follows:
*
* '0'-'9', 'A'-'Z', 'a'-'z', '_', '-', ';', '/', '?', ':', '@', '&',
* '=', '+', '$', '.', '!', '~', '*', '\'', '(', ')', '%'.
*
* If we are inside a verbatim tag <...> (parameter uri_char is true)
* then also the following flow indicators are allowed:
* ',', '[', ']'
*/
while (IS_ALPHA(parser->buffer) || CHECK(parser->buffer, ';')
|| CHECK(parser->buffer, '/') || CHECK(parser->buffer, '?')
|| CHECK(parser->buffer, ':') || CHECK(parser->buffer, '@')
|| CHECK(parser->buffer, '&') || CHECK(parser->buffer, '=')
|| CHECK(parser->buffer, '+') || CHECK(parser->buffer, '$')
|| CHECK(parser->buffer, '.') || CHECK(parser->buffer, '%')
|| CHECK(parser->buffer, '!') || CHECK(parser->buffer, '~')
|| CHECK(parser->buffer, '*') || CHECK(parser->buffer, '\'')
|| CHECK(parser->buffer, '(') || CHECK(parser->buffer, ')')
|| (uri_char && (
CHECK(parser->buffer, ',')
|| CHECK(parser->buffer, '[') || CHECK(parser->buffer, ']')
)
))

According to theYAML 1.2 Spec the ns-uri-char does include #, which is missing from the scanner.

[39] ns-uri-char ::=
    (
      '%'
      [ns-hex-digit](https://yaml.org/spec/1.2.2/#rule-ns-hex-digit){2}
    )
  | [ns-word-char](https://yaml.org/spec/1.2.2/#rule-ns-word-char)
  | '#'
  | ';'
  | '/'
  | '?'
  | ':'
  | '@'
  | '&'
  | '='
  | '+'
  | '$'
  | ','
  | '_'
  | '.'
  | '!'
  | '~'
  | '*'
  | "'"
  | '('
  | ')'
  | '['
  | ']'

This prevents creating a TAG line such as the following:

%TAG ! http://www.w3.org/2001/XMLSchema#

As a workaround, %TAG ! http://www.w3.org/2001/XMLSchema%23 works, but is not ideal, and shouldn't be required based on the grammar.

The scanning issue extends to inline-tags, as well. If you parse the following

%TAG !xsd! http://www.w3.org/2001/XMLSchema%23
---
date: !xsd!date 2022-08-08

and re-serialize without the %TAG directive, you'll get the following:

date: !<http://www.w3.org/2001/XMLSchema%23date> 2022-08-08

Per the grammar, you should also be able to parse the following:

date: !<http://www.w3.org/2001/XMLSchema#date> 2022-08-08

But, it fails in a similar manner to that reported on %TAG. In this case, it is the c-verbatim-tag which includes ns-uri-char+ where the # is again excluded.

Working around this requires a pre-parsing step to replace these characters are appropriate before parsing and after serializing.

This is tested using Ruby Psych version 4.0.4, which wraps libyaml, and the issues seem to be entirely within the library.