kevinmehall/rust-peg

Improve documentation and examples how to handle strings properly

jbx1 opened this issue · 5 comments

jbx1 commented

The documentation only shows examples of parsing numbers and single characters. Almost all the tests also don't parse strings, which makes it hard to know what one needs to do, especially if one is a bit of a beginner in Rust and rust-peg.

The information about the 'input lifetime annotation is a bit elusive (not documented), and it is not clear how this affects the lifetime annotations needed for any structs receiving the parsed str or Vec of str.

It would also be great if there were some recommendations as to how strings should be parsed, and if zero-copy can be achieved in any way.

Some proper documentation with a few examples of parsing singular or vec of strings (with operators such as ** and ++) would be really helpful.

The $() operator returns an &'input str slice of the input string corresponding to the text matched by the expression inside, and is zero-copy:

pub rule alphanumeric1() -> &'input str = $(['a'..='z' | 'A'..='Z' | '0'..='9']+)

though if you want to copy it into an owned String you can do so in an action:

pub rule alphanumeric2() -> String = v:$(['a'..='z' | 'A'..='Z' | '0'..='9']+) { v.to_owned() }

You can compose these into something that parses a sequence of strings:

pub rule alphanumeric_seq1() -> Vec<&'input str> = alphanumeric1() ** ","
pub rule alphanumeric_seq2() -> Vec<String> = alphanumeric2() ** ","

or inline the rule if you don't want the separate rule:

pub rule alphanumeric_seq2a() -> Vec<String> = (v:$(['a'..='z' | 'A'..='Z' | '0'..='9']+) { v.to_owned() }) ** ","

If by "string" you mean something like a quoted string literal, it gets a little more complicated to handle escape sequences rather than a simple slice of the input:

   pub rule double_quoted_string() -> String
    = "\""  s:double_quoted_character()* "\"" { s.into_iter().collect() }

    rule double_quoted_character() -> char
      = [^ '"' | '\\' | '\r' | '\n' ]
      / "\\n" { '\n' }
      / "\\u{" value:$(['0'..='9' | 'a'..='f' | 'A'..='F']+) "}" {?
            u32::from_str_radix(value, 16).ok().and_then(char::from_u32).ok_or("valid unicode code point")
        }
      / expected!("valid escape sequence")

Hope that helps. Leaving this issue open for these examples to be integrated somewhere in the documentation.

jbx1 commented

That's great. Maybe a bit more details about the semantics of the 'input lifetime would be helpful.

The 'input lifetime just gets used for the the input argument in the generated parse function. So a rule like

pub rule x() -> Vec<&'input str> = ($(['a'..='z')) ** ","

expands into a function like

fn x(input: &'input str) -> Result<Vec<&'input str>, ParseError>

In #299 (probably for 0.9), the name will be customizable instead of hard-coded, making it seem a little less magical.

How can we match unicode identifiers? Is it possible to use unicode-ident in the grammar?

Yes, [ ] patterns allow a boolean if like Rust's match cases, so you can do something like

rule identifier() -> &'input str = $([c if is_xid_start(c)] [c if is_xid_continue(c)]*)