Trouble with quoted string that starts with a `#` character

Question

Trouble with quoted string that starts with a `#` character

rudedogg opened this issue 2 years ago · 3 comments

I think the parser is having trouble with strings starting with a # character. If you add the following test case, the parser will crash on it:

test "single quoted string with a pound symbol" {
    const source =
        \\- hello
        \\- '#hello'
    ;

    var yaml = try Yaml.load(testing.allocator, source);
    defer yaml.deinit();

    const arr = try yaml.parse([2][]const u8);
    try testing.expectEqual(arr.len, 2);
    try testing.expect(mem.eql(u8, arr[0], "hello"));
    try testing.expect(mem.eql(u8, arr[1], "#hello"));
}

I think this is valid. I've been using zig-yaml to parse color schemes (with lots of RGB color hex codes), but after updating the project I ran into this issue.

Answer 1 · 2022-09-18T17:45:10.000Z

I looked into this a little more, and I think the problem is in the Tokenizer, and not specific to the # character prefix.

After eating a single or double quote, we set self.string_type = .{single|double}_quoted. But the following call to Tokenizer.next() will be in the .start state still, and parse as a comment, anchor, etc., if the quoted string begins with one of the characters handled before the else => { state = .literal } case.

Here are a couple tests I've added to Tokenizer.zig to try and get something working:

test "quoted literals" {
    try testExpected(
        \\'#000000'
        \\'[000000'
        \\"&someString"
    , &[_]Token.Id{
        .single_quote,
        .literal,
        .single_quote,
        .new_line,
        .single_quote,
        .literal,
        .single_quote,
        .new_line,
        .double_quote,
        .literal,
        .double_quote,
        .eof,
    });
}

I was able to get my quoted strings starting with # to parse by changing the implementation of switch (state) (c) { .start =># { ... }` (line 136 of Tokenizer.zig):

'#' => {
    if (self.string_type != .unquoted) {
        state = .literal;
    } else {
        state = .comment;
    }
},

but this is just a hack for that specific case. I tried thinking of how to fix this generally, but I'm new to parsers and nothing great jumped out at me. Maybe an additional state for when we're in a quoted context? Or somehow use the exclusion list on Parser.eatToken() or Parser.eatCommentsAndSpace()?

Answer 2 · 2022-11-22T15:02:46.000Z

Thanks for submitting the issue, and apologies for not tending to it earlier but I find it hard to devote enough time to all of my subprojects these days. Anyhow, I have provided a fix to this issue in #15 if you would like to have a look. Your investigation was spot on in the sense that it is currently very difficult to handle this with how we handle quoted strings in the Tokenizer, and so I went ahead and rewrote that bit entirely in the hope that it will be easier to handle quoted strings from now on. The modified approach is described in the PR itself.

Answer 3 · 2022-11-28T22:22:03.000Z

Thanks for submitting the issue, and apologies for not tending to it earlier but I find it hard to devote enough time to all of my subprojects these days. Anyhow, I have provided a fix to this issue in #15 if you would like to have a look. Your investigation was spot on in the sense that it is currently very difficult to handle this with how we handle quoted strings in the Tokenizer, and so I went ahead and rewrote that bit entirely in the hope that it will be easier to handle quoted strings from now on. The modified approach is described in the PR itself.

No worries at all on the timing! I guessed when I opened the issue you would be tied up with 0.10 stuff.

Thank you for the fix and explanation in #15; I'm happy the test and exploration I did helped.