Zero code point not handled in utf-8 mode?
Closed this issue · 3 comments
Hi, I may be missing something but I have code that works with "re2c:encoding:utf8 = 0" and fails with "re2c:encoding:utf8 = 1".
Below are the rules, the full code is here:
<init> "'" => str {
std::cout << "-- got start quote" << std::endl;
goto yyc_str;
}
<str> [^'\x00]+ {
std::cout << "-- got some chars" << std::endl;
goto yyc_str;
}
<str> "'" => init {
std::cout << "-- got end quote" << std::endl;
return makeToken(TokenType::Str);
}
<str> [^'\x00] / '\x00' => init {
std::cout << "-- reached zero char while in a string" << std::endl;
return makeToken(TokenType::UnterminatedStr);
}
<init> '\x00' {
std::cout << "-- stopping" << std::endl;
break;
}
The input is 'aa
. With "utf-8" off, it works just fine and I get "Type = 1" which is TokenType::UnterminatedStr.
-- got start quote
-- got some chars
-- got some chars
-- reached zero char while in a string
Type = 1, value = "'aa"
-- stopping
And with "utf-8" on, the code loops indefinitely:
-- got start quote
-- got some chars
-- got some chars
-- got some chars
-- got some chars
...
It looks like zero code point is not handled in " [^'\x00]" in this case. Or am I doing something wrong?
Rerun the generation with -W
option (it enables re2c warnings): you'll see that re2c complains about undefined control flow:
warning: control flow in condition 'str' is undefined for strings that match
'[\x0\x80-\xC1\xF5-\xFF]'
'[\xC2-\xDF] [\x0-\x7F\xC0-\xFF]'
'\xE0 [\x0-\x9F\xC0-\xFF]'
'[\xE1-\xEF] [\x0-\x7F\xC0-\xFF]'
'\xF0 [\x0-\x8F\xC0-\xFF]'
'[\xF1-\xF3] [\x0-\x7F\xC0-\xFF]'
'\xF4 [\x0-\x7F\x90-\xFF]'
'\xE0 [\xA0-\xBF] [\x0-\x7F\xC0-\xFF]'
'[\xE1-\xEF] [\x80-\xBF] [\x0-\x7F\xC0-\xFF]'
'\xF0 [\x90-\xBF] [\x0-\x7F\xC0-\xFF]'
'[\xF1-\xF3] [\x80-\xBF] [\x0-\x7F\xC0-\xFF]'
'\xF4 [\x80-\x8F] [\x0-\x7F\xC0-\xFF]'
'\xF0 [\x90-\xBF] [\x80-\xBF] [\x0-\x7F\xC0-\xFF]'
'[\xF1-\xF3] [\x80-\xBF] [\x80-\xBF] [\x0-\x7F\xC0-\xFF]'
'\xF4 [\x80-\x8F] [\x80-\xBF] [\x0-\x7F\xC0-\xFF]'
, use default rule '*' [-Wundefined-control-flow]
This means that not all possible code patch are covered by your rules: if the input happens to satisfy one of the above patterns, control flow in your program is undefined. What you need is to define the default rule *
in every condition. See here for details: https://re2c.org/manual/warnings/warnings.html#wundefined-control-flow
Yeah, my bad. The code is actually broken in the non-utf8 mode too.
Thanks for the prompt reply!
Should this be closed?