m4rw3r/chomp

run_scanner state can't depend on last token

dckc opened this issue · 6 comments

dckc commented

I'm trying to parse one utf8 character. I tried run_scanner and std::char::from_u32, but it doesn't work because when I get a whole character, the way to signal it is to return None, which throws away the state.

dckc commented

I tried this (in my client code), but I got error: traitInputBufferis private.

/// Like `run_scanner` but without throwing away final state.
fn scan_aux<I: Copy, S: Copy, F>(i: Input<I>, s: S, mut f: F) -> SimpleResult<I, (&[I], S)>
  where F: FnMut(S, I) -> (S, bool) {
    use chomp::input::InputBuffer;
    let b         = i.buffer();
    let mut state = s;

    match b.iter().position(|&c| { let (v, cont) = f(state, c);
                                   if cont { state = v; false }
                                   else { true } }) {
        Some(n) => i.replace(&b[n..]).ret((&b[0..n], state)),
        // TODO: Should this following 1 be something else, seeing as take_while1 is potentially
        // infinite?
        None    => i.incomplete(1),
    }
}

InputBuffer is not private, it is just exposed through the primitives module instead of the input module directly. This is to avoid using anything private from the input module in Chomp itself (ie. Chomp should be implemented with the same limitations as any third-party combinators/parsers).

As for parsing a single UTF-8 character: is there any specific reason why you would want to use scan for this? If you are parsing a string you should probably just treat it as a slice of bytes, and then use std::str::from_utf8 to convert it to a string (combined with your own error type to wrap both Utf8Error and chomp::Error, this will make it pretty flexible while still keeping it zero-copy).

If you are looking for parsing a single character into a char, then making a specific parser for that would be suitable. Such a parser would be useful as a part of the Chomp library itself even.

dckc commented

I am parsing a single character into a char.

This is what I managed to get working (before I saw your clue about the primitives module), though the repeated parsing is far from ideal. I'm not sure 4 is the longest representation of a char in utf8, either:

fn utf8_char(i: Input<u8>) -> U8Result<char> {
    fn validate_char<'a>(i: Input<'a, u8>, bs: &'a [u8]) -> U8Result<'a, char> {
        let ss = if bs.len() > 0 { str::from_utf8(bs).ok() } else { None };
        if let Some(s) = ss {
            let ch = s.chars().next().unwrap();
            i.ret(ch)
        } else {
            i.err(chomp::parsers::Error::new())
        }
    }

    or(i, |i| take(i, 1).bind(validate_char),
       |i| or(i, |i| take(i, 2).bind(validate_char),
          |i| or(i, |i| take(i, 3).bind(validate_char),
             |i| take(i, 4).bind(validate_char))))
}

FWIW, the string parser is more straightforward, though if there's a more concise/idiomatic way to do this sort of validation, I'd be interested to know:

fn utf8_str(i: Input<u8>) -> U8Result<String> {
    fn check<'a>(i: Input<'a, u8>, bs: &'a [u8]) -> U8Result<'a, String> {
        match String::from_utf8(bs.to_owned()) {
            Ok(s) => i.ret(s),
            Err(_) => i.err(chomp::parsers::Error::new())  // TODO: nicer error?
        }
    };

    parse!{i;
           let len = var_int(); // TODO: check for int overflow
           let s = i -> take(i, len as usize).bind(check);
           ret s
    }
}

It looks pretty close to what I would do currently. At a later point there will probably be more tools for dealing with UTF-8 in Chomp.

One thing though, since you do not have anything in particular to parse from the string itself (eg. escape sequences) you can use std::str::from_utf8 instead of String::from_utf8 to prevent an allocation. Of course this will tie the lifetime to the input slice, but that might not be an issue depending on usage.

@dckc Is this solved?

dckc commented

Yes, I expect the clue about the primitives module solves this. I haven't tested it, though.