skvadrik/re2c

Zero code point not handled in utf-8 mode?

Closed this issue · 3 comments

Hi, I may be missing something but I have code that works with "re2c:encoding:utf8 = 0" and fails with "re2c:encoding:utf8 = 1".
Below are the rules, the full code is here:

<init> "'" => str {
    std::cout << "-- got start quote" << std::endl;
    goto yyc_str;
}

<str> [^'\x00]+ {
    std::cout << "-- got some chars" << std::endl;
    goto yyc_str;
}

<str> "'" => init {
    std::cout << "-- got end quote" << std::endl;
    return makeToken(TokenType::Str);
}

<str> [^'\x00] / '\x00' => init {
    std::cout << "-- reached zero char while in a string" << std::endl;
    return makeToken(TokenType::UnterminatedStr);
}

<init> '\x00' {
    std::cout << "-- stopping" << std::endl;
    break;
}

The input is 'aa. With "utf-8" off, it works just fine and I get "Type = 1" which is TokenType::UnterminatedStr.

-- got start quote
-- got some chars
-- got some chars
-- reached zero char while in a string
Type = 1, value = "'aa"
-- stopping

And with "utf-8" on, the code loops indefinitely:

-- got start quote
-- got some chars
-- got some chars
-- got some chars
-- got some chars
...

It looks like zero code point is not handled in " [^'\x00]" in this case. Or am I doing something wrong?

Rerun the generation with -W option (it enables re2c warnings): you'll see that re2c complains about undefined control flow:

warning: control flow in condition 'str' is undefined for strings that match 
        '[\x0\x80-\xC1\xF5-\xFF]'
        '[\xC2-\xDF] [\x0-\x7F\xC0-\xFF]'
        '\xE0 [\x0-\x9F\xC0-\xFF]'
        '[\xE1-\xEF] [\x0-\x7F\xC0-\xFF]'
        '\xF0 [\x0-\x8F\xC0-\xFF]'
        '[\xF1-\xF3] [\x0-\x7F\xC0-\xFF]'
        '\xF4 [\x0-\x7F\x90-\xFF]'
        '\xE0 [\xA0-\xBF] [\x0-\x7F\xC0-\xFF]'
        '[\xE1-\xEF] [\x80-\xBF] [\x0-\x7F\xC0-\xFF]'
        '\xF0 [\x90-\xBF] [\x0-\x7F\xC0-\xFF]'
        '[\xF1-\xF3] [\x80-\xBF] [\x0-\x7F\xC0-\xFF]'
        '\xF4 [\x80-\x8F] [\x0-\x7F\xC0-\xFF]'
        '\xF0 [\x90-\xBF] [\x80-\xBF] [\x0-\x7F\xC0-\xFF]'
        '[\xF1-\xF3] [\x80-\xBF] [\x80-\xBF] [\x0-\x7F\xC0-\xFF]'
        '\xF4 [\x80-\x8F] [\x80-\xBF] [\x0-\x7F\xC0-\xFF]'
, use default rule '*' [-Wundefined-control-flow]

This means that not all possible code patch are covered by your rules: if the input happens to satisfy one of the above patterns, control flow in your program is undefined. What you need is to define the default rule * in every condition. See here for details: https://re2c.org/manual/warnings/warnings.html#wundefined-control-flow

Yeah, my bad. The code is actually broken in the non-utf8 mode too.
Thanks for the prompt reply!

Should this be closed?