kevinmehall/rust-peg

Error recovery when using `precedence!`

cylixlee opened this issue · 2 comments

Hi kevin I'm using peg to generate an arithmetic calculator with quote supporting. This calculator takes several calculation statements: a valid expression followed by a single semicolon ";".

I want to add a simple error recovery for this calculator: just skip everything before the expression boundary. For example:

  1. 1+2; is a valid expression statement and is parsed as Add(Number(1), Number(2)).
  2. 1+error; is not valid because of the unrecognized right-hand-side expression of +, and is parsed as Add(Number(1), Error).

In the cases above, I can just set the boundary of expression boundary to semicolon ; to skip everything except semicolon, and continue to parse next statements.

Of cource, there's a case that's not very natural: error + 1; will produce Error expression instead of Add(Error, Number(1)), but that's ok with me.

The problem appears when it comes to grouping expressions (parenthesis quoted expressions). Naturally, I wrote code like this:

#[derive(Debug)]
enum Expression {
    Number(f64),
    Add(Box<Expression>, Box<Expression>),
    Subtract(Box<Expression>, Box<Expression>),
    Multiply(Box<Expression>, Box<Expression>),
    Divide(Box<Expression>, Box<Expression>),

    // Special
    Error,
}

peg::parser!(grammar pegparser() for str {
    use std::str::FromStr;

    pub rule statements() -> Vec<Expression>
        = _ es:expression_statement()* _ { es }

    rule expression_statement() -> Expression
        = e:expression(';') _ ";" _ { e }

    rule expression(boundary: char) -> Expression = precedence! {
        x:(@) _ "+" _ y:@ { Expression::Add(Box::new(x), Box::new(y)) }
        x:(@) _ "-" _ y:@ { Expression::Subtract(Box::new(x), Box::new(y)) }
        --
        x:(@) _ "*" _ y:@ { Expression::Multiply(Box::new(x), Box::new(y)) }
        x:(@) _ "/" _ y:@ { Expression::Divide(Box::new(x), Box::new(y)) }
        --
        n:number() { n }
        "(" _ e:expression(')') _ ")" { e }
        [^boundary]+ { Expression::Error } /* here if I change [^boundary] to [^';'], it goes ok. */
    }

    rule _ = blank()*
    rule blank()
        = [' '|'\t'|'\r'|'\n']
    rule number() -> Expression
        = s:$(['0'..='9']+ ("." ['0'..='9']+)?) {
            match f64::from_str(s) {
                Ok(number) => Expression::Number(number),
                Err(e) => {
                    eprintln!("{}", e);
                    Expression::Error
                }
            }
        }
});

fn main() {
    println!("{:?}", pegparser::statements("1 + (error) + 2;"));
}

I'm expecting it to produce Add(Add(Number(1), Error), Number(2)), or at least Add(Number(1), Error). However it just fails and returns a Err(ParseError):

ParseError {
    location: LineCol { line: 1, column: 6, offset: 5 },
    expected: ExpectedSet {
        expected: {
            "\"(\"", 
            "[' '|'\\t'|'\\r'|'\\n']",
            "['0'..='9']",
            "[^boundary]"
        } 
    } 
}

When I change the line I marked with comment, the result turns out ok: Add(Number(1), Error). That's weird because the boundary is a char and should be acceptible in patterns and it just doesn't work. It can't even parse expressions without parentheses like 1 + error;.

I wonder if my code is wrong or not and is there any better solutions.

Necessity

Since I want to use peg in a programming language parser, I can't just set the ';' as the expression boundary and skip everything.
Take this pseudo-code snippet as example:

if (1 + error) {}

I want to produce something like IfStmt { condition: AddExpr(Number(1), Error) } instead of a rough Error expression.

You're expecting [^boundary] to match any character other than the one passed in as an argument, but because of how PEG [ ] expands to a Rust pattern in an arm of a Rust match expression, that actually never matches anything.

[x] expands to a match arm with pattern x, and ^ flips the accepting and rejecting arms of the match. So [^boundary] expands to something like

match next_char {
    boundary => reject()
   _ => accept()
}

An identifier like boundary as a Rust pattern matches anything and captures it into a new variable, which in this case is ignored. That variable shadows the argument boundary variable, rather than comparing the character to it.

A variable with ^ isn't very useful because it leads to the rejection arm where you can't use the variable. It's most useful in cases with custom token types, where you can do [MyTokenEnum::Ident(x)] and then use the captured x in a subsequent block.

Instead of [^boundary], try [c if c != boundary], which expands like

match next_char {
    c if c != boundary => accept()
   _ => reject()
}

Ah, that's very clear to me now! Thanks kevin.