request: provide more helpful context in parse error messages
mcandre opened this issue · 1 comments
It would add a lot of helpful context to the user, if we stated the offending character in parse error messages.
The current error message format looks frightening and vague:
error at 1:16: expected one of " ", "$(", "${", ":", "\t", [^ (' ' | '\t' | ':' | ';' | '#' | '\r' | '\n')]
But we can add a ton of useful context by simply indicating the offending character, like:
error at 1:16: got end of file, expected one of " ", "$(", "${", ":", "\t", [^ (' ' | '\t' | ':' | ';' | '#' | '\r' | '\n')]
error at 1:16: got "\n", expected one of " ", "$(", "${", ":", "\t", [^ (' ' | '\t' | ':' | ';' | '#' | '\r' | '\n')]
error at 1:16: got "3", expected one of " ", "$(", "${", ":", "\t", [^ (' ' | '\t' | ':' | ';' | '#' | '\r' | '\n')]
error at 1:16: got "#", expected one of " ", "$(", "${", ":", "\t", [^ (' ' | '\t' | ':' | ';' | '#' | '\r' | '\n')]
Etc.
That way, the user has not only the line and column number to work with, but they can immediately see the problematic character in the text, even before they go to open and examine the file.
Furthermore, we should also include the name(s) of the current rule(s) that were attempted to parse the line, but were not able to complete. This helps the user to know what kind of entity the parser expected there, rather than just some list of random characters.
error at 1:16: got end of file, expected macro, include, target, or comment with " ", "$(", "${", ":", "\t", [^ (' ' | '\t' | ':' | ';' | '#' | '\r' | '\n')]
error at 1:16: got "3", expected parenthetical group with ")"
Etc.
Finally, including the offending line contents, would also be extremely helpful for rapidly troubleshooting parse errors. Similar to how rustc parse error messages include line contents.
You can customize the error messages for your language by accessing the fields of the ParseError
instead of using the default Display
impl. For instance, you can index the input using the location to get the character found at that location. You could also use codespan-reporting or similar to display an error message with highlights and underlines on the source line, like rustc does, which conveys more information that just quoting a character.
I left out the "found" from the default message because in a character based parser, it could only quote a single character, without any knowledge of token structure, which might be confusing or even appear nonsensical. For example, if you typoed a keyword class
as clas
, the error position would be on the c
because that's where the literal starts, so you'd see "found "c", expected one of "class", "function", "type"...
. I assume anyone going through the trouble of implementing a lexer and passing the tokens that would be necessary to fix this is also going to be implementing their own error message display.
For the second part, you can do this by annotating your grammar with quiet!{}
to suppress the default entries and expected!()
to add your own. See https://docs.rs/peg/latest/peg/#failure-reporting :
e.g.
rule whitespace() = quiet!{[' ' | '\n' | '\t']+}
rule identifier()
= quiet!{[ 'a'..='z' | 'A'..='Z']['a'..='z' | 'A'..='Z' | '0'..='9' ]*}
/ expected!("identifier")