mismatched YYBACKUP and YYRESTORE (again)
mejedi opened this issue ยท 12 comments
The generated state machine invokes YYRESTORE
without calling YYBACKUP
first (for certain inputs). It results in a subsequent crash since we are restoring with a bogus state.
Please let me know if the below assumptions are wrong
YYBACKUP
must be executed at least once beforeYYRESTORE
$
rule fires when EOF is reached AND state machine is not in a final state AND could not reach final state viaYYRESTORE
.
Here's the reproducer (I'm on re2c 3.1
):
/*!re2c
re2c:eof = 0;
re2c:yyfill:enable = 0;
[A>]* ">" {}
$ {}
*/
Please find the first 20 lines of the output below:
$ re2go repro.re --no-generation-date -i -o /dev/stdout | head -n 20
// Code generated by re2c 3.1, DO NOT EDIT.
{
var yych YYCTYPE
yych = YYPEEK
switch (yych) {
case '>':
goto yy2
case 'A':
goto yy4
default:
if (YYLESSTHAN) {
goto yy5
}
goto yy1
}
yy1:
YYRESTORE
goto yy3
yy2:
Hi Nick, thanks for reporting. There is a catch here: you should define default rule *
, otherwise control flow is undefined on some paths (and these paths happen to be the ones that do YYRESTORE without a previous YYBACKUP).
With -W re2c warns about this:
$ re2go repro.re -W -i
repro.re:1:0: warning: control flow is undefined for strings that match
'\x41'
'[\x0-\x3D\x3F-\x40\x42-\xFF]'
, use default rule '*' [-Wundefined-control-flow]
// Code generated by re2c 3.1 on Wed May 15 22:29:11 2024, DO NOT EDIT.
{
var yych YYCTYPE
yych = YYPEEK
switch (yych) {
case '>':
goto yy2
case 'A':
goto yy4
default:
if (YYLESSTHAN) {
goto yy5
}
goto yy1
}
yy1:
YYRESTORE
goto yy3
yy2:
YYSKIP
YYBACKUP
yych = YYPEEK
switch (yych) {
case '>':
goto yy2
case 'A':
goto yy4
default:
goto yy3
}
yy3:
{}
yy4:
YYSKIP
yych = YYPEEK
switch (yych) {
case '>':
goto yy2
case 'A':
goto yy4
default:
goto yy1
}
yy5:
{ $ }
}
If you add default rule *
, the warning is gone, and the generated code is correct:
$ cat fixed.re
/*!re2c
re2c:eof = 0;
re2c:yyfill:enable = 0;
[A>]* ">" {}
$ { $ }
* { * }
*/
$ re2go fixed.re -W -i
// Code generated by re2c 3.1 on Wed May 15 22:27:22 2024, DO NOT EDIT.
{
var yych YYCTYPE
yyaccept := 0
yych = YYPEEK
switch (yych) {
case '>':
goto yy3
case 'A':
goto yy5
default:
if (YYLESSTHAN) {
goto yy8
}
goto yy1
}
yy1:
YYSKIP
yy2:
{ * }
yy3:
yyaccept = 0
YYSKIP
YYBACKUP
yych = YYPEEK
switch (yych) {
case '>':
goto yy3
case 'A':
goto yy6
default:
goto yy4
}
yy4:
{}
yy5:
yyaccept = 1
YYSKIP
YYBACKUP
yych = YYPEEK
switch (yych) {
case '>':
goto yy3
case 'A':
goto yy6
default:
goto yy2
}
yy6:
YYSKIP
yych = YYPEEK
switch (yych) {
case '>':
goto yy3
case 'A':
goto yy6
default:
goto yy7
}
yy7:
YYRESTORE
if (yyaccept == 0) {
goto yy4
} else {
goto yy2
}
yy8:
{ $ }
}
Please let me know if the below assumptions are wrong
- YYBACKUP must be executed at least once before YYRESTORE
- $ rule fires when EOF is reached AND state machine is not in a final state AND could not reach final state via YYRESTORE.
The first assumption is correct. The second one is not: $
rule fires in the initial state (and takes priority over any other rule accepting empty string). It does not fire in any state where lexer reaches the end of input - those cases are covered by the default rule *
. This is because the $
rule means "the lexer has reached the end of input" not "the lexer has consumed whatever garbage happened to be there and then reached the end of input". If you look at the generated code (for both repro.re and fixed.re) you'll see that there's only one transition from the initial state to the $
rule, and it is guarded by YYLESSTHAN check. I hope this makes sense.
mismatched YYBACKUP and YYRESTORE (again)
Again - are you referring to some other bug in the past? Can you link it here?
@skvadrik thank you for the insightful response. A hint concerning -w
switch is especially useful.
I made a mistake when reducing the test case. Please find a slightly bigger sample below
/*!re2c
re2c:eof = 0;
re2c:yyfill:enable = 0;
.* ">" { good }
[^\n>]* "\n" { unterm_nl }
$ { unterm_eof }
*/
The intention is to match the closing >
delimiter. If there are multiple found before the newline, use the rightmost one. This time, -w
doesn't produce a warning.
The output is quite unwieldy, a file is attached. It looks like "A"
is processed as follows
yych <= '\n'
is false, executeelse
branch inif
statement on line 6- fall through and reach
yy1:
since the character is not>
- current character is now 0 (a sentinel)
yych <= '\n'
is true, executethen
branch inif
on line 24- both
yych <= 0
andYYLESSTHAN
are true, hencegoto yy6
YYRESTORE
Again - are you referring to some other bug in the past? Can you link it here?
The intention is to match the closing > delimiter. If there are multiple found before the newline, use the rightmost one.
The rule .* ">"
is greedy, so the generated lexer always attempts to match more input: it matches any >
as part of .*
. Normally one would explicitly exclude the terminator: [^\n>]* ">"
or, if you want to allow escaped >
, then ([^\n>] | "\\>")* ">"
. But in your case, if you want to allow unescaped >
, you need something like ([^>\x00]* [>])+
.
See this small example for demonstration:
#include <assert.h>
int lex(const char *s) {
const char *YYCURSOR = s, *YYMARKER;
/*!re2c
re2c:yyfill:enable = 0;
re2c:define:YYCTYPE = char;
([^>\x00]* [>])+ { return YYCURSOR - s; }
* { return -1; }
*/
}
int main() {
assert(lex("aaaabb") == -1);
assert(lex("aa>aabb") == 3);
assert(lex("aa>aa>bb") == 6);
assert(lex(">>>>") == 4);
return 0;
}
The same can be done with the $
rule (there's no principal difference here, the example is a bit bigger, but posting for completeness):
#include <assert.h>
#include <string.h>
int lex(const char *s) {
const char *YYCURSOR = s, *YYMARKER, *YYLIMIT = s + strlen(s);
/*!re2c
re2c:yyfill:enable = 0;
re2c:eof = 0;
re2c:define:YYCTYPE = char;
([^>\x00]* [>])+ { return YYCURSOR - s; }
* { return -1; }
$ { return -2; }
*/
}
int main() {
assert(lex("") == -2);
assert(lex("aaaabb") == -1);
assert(lex("aa>aabb") == 3);
assert(lex("aa>aa>bb") == 6);
assert(lex(">>>>") == 4);
return 0;
}
It would be better if re2c warned you about unintentional greedy iteration, but it can only do so if the iteration never stops, and here it does stop on \n
.
This time, -w doesn't produce a warning.
Are you sure? It does for me:
warning: control flow is undefined for strings that match '[\x0-\x9\xB-\x3D\x3F-\xFF]', use default rule '*' [-Wundefined-control-flow]
@skvadrik
Thank you for your help; I think I was finally able to fix my misconceptions. Apparently getting the following bits is crucial to use re2c
successfully. It would be nice if they were emphasised more prominently in the docs:
- automata must handle any possible input; failing to do so invites nasal demons (undefined behaviour);
- for this reason, always run with warnings on;
- EOF rule is matched if there was no input consumed so far;
- default rule consumes a single character and there's no way to tell how much progress was made (in terms of characters) before failing.
Playing with the sample you helpfully provided:
#include <assert.h>
#include <string.h>
int lex(const char *s) {
const char *YYCURSOR = s, *YYMARKER, *YYLIMIT = s + strlen(s);
/*!re2c
re2c:yyfill:enable = 0;
re2c:eof = 0;
re2c:define:YYCTYPE = char;
.* ">" { return YYCURSOR - s; }
* { return s - YYCURSOR; /* negated token length */ }
$ { return -1000; }
*/
}
int main() {
assert(lex("") == -1000); // <-- EOF rule
assert(lex("1") == -1); // <-- default rule
assert(lex("1\n") == -1); // <-- default rule
assert(lex("12") == -1); // <-- default rule; I wish it returned -2
assert(lex(">1") == 1);
assert(lex(">1\n") == 1);
assert(lex(">1>") == 3);
return 0;
}
This time, -w doesn't produce a warning.
Are you sure? It does for me:
My bad, there was the warning.
Sorry for the noise, and thanks for the help.
It would be nice if they were emphasised more prominently in the docs
- automata must handle any possible input; failing to do so invites nasal demons (undefined behaviour);
- for this reason, always run with warnings on;
- EOF rule is matched if there was no input consumed so far;
- default rule consumes a single character and there's no way to tell how much progress was made (in terms of characters) before failing.
There are some docs explaining these things, although they may be not easy to find. Linking here in case someone will be reading this while debugging similar issues:
- 1 and 4 is covered by https://re2c.org/manual/warnings/warnings.html#wundefined-control-flow:
A good advice is, always use default rule *: it matches any code unit regardless of encoding, consumes a single code unit no matter what and always has the lowest priority. Note that * is a built-in hack: it cannot be expressed through ordinary rules.
- 3 is covered by https://re2c.org/manual/manual_go.html#handling-the-end-of-input and https://re2c.org/manual/manual_go.html#sentinel-with-bounds-checks:
Reaching the end of input opens three possibilities: if the lexer is in the initial state it will match the end-of-input rule $, otherwise it may fallback to a previously matched rule (including default rule *) or go to a default state, causing -Wundefined-control-flow.
As for 2, it would be good to set -Wundefined-control-flow
to be an error by default, but we can't do this for C as re2c is an old tool and there's a lot of old code without default rule *
(previously [^]
was used as a replacement for *
, but it's not quite the same). Maybe for other languages it should become an error. Note that there is a -Werror
option (and -Werror-undefined-control-flow
in particular).
@skvadrik I have started to think that some of the warnings should probably be default on instead of default off. It is hard to get people to notice everything in the documentation, but error messages are prominently self documenting.
@skvadrik I have started to think that some of the warnings should probably be default on instead of default off. It is hard to get people to notice everything in the documentation, but error messages are prominently self documenting.
For all languages except C, I agree; for C we need to do some investigation first and see if php / ninja / other old users of re2c that we know of will be broken by this - if they are, we should reach out and fix them first.
Anyway, I agree something should be done about this, as it keeps causing people a lot of debugging for no good reason.
If re2c provides the ability to turn off the warning, and maybe even prints out how to do it in the default warning message, it will be easy for downstream packagers of older software that uses re2c to notice what the fix is. I agree, though, that php and ninja and other well known users should be informed if they get broken.
I'll reopen this issue as a reminder to do something about it.
If re2c provides the ability to turn off the warning
There's been -Wno-*
and -Wno-error-*
options since the addition of warnings, and re2c prints the category (e.g. [-Wundefined-control-flow]
) at the end of the warning/error message.
Then changing the default should be pretty safe.