m4rw3r/chomp

Non `Copy` `Token`s?

m4rw3r opened this issue · 2 comments

As the title says, any way we can remove the Copy requirement on the Token associated type?

The main drawback of removing it is that most filter-functions will need to operate on borrows (satisfy, take_while and so on), which will introduce some extra typing (one or two extra & or * per filter).

The benefit of removing Copy would be that we can easily do multi-stage parsing (commonly lexer + parser in separate parts) solely in Chomp, relying on the previous parser to construct a new token stream, and this could then refer to the initially parsed data all the way through (depending of course on lifetime requirements of the source itself).

This would be a breaking change, probably 1.0.0 and/or 0.4.0.

Notes:

  • Removing Copy requirement requires either a Clone or peek to return a reference.

    The latter prevents us from using &str as an input since the character iterator provides us with char which we cannot return as a reference since we have nothing to borrow from.

    The former is also not a fantastic idea, since this would cause clone to be called for anything using peek (which includes satisfy, satisfy_with, token, not_token, peek, peek_next, eof). Plus of course the Clone requirement itself.

  • ascii module predicates and any predicates operating on actual tokens (very common when working with small token-types, like u8 and char) does not work as a slot-in for any predicate, they must operate on references or the value has to be dereferenced before calling the predicate:

    take_while(i, |&c| is_whitespace(c))
    satisfy(i, |c| *c == b'a')
    // vs
    take_while(i, is_whitespace)
    satisfy(i, |c| c == b'a')

If we do not remove the Copy requirement, multi-stage parsers have to use the requirement B: Copy + Buffer to work properly which imposes restrictions on the underlying Input. Also possible is to perform the conversion of a buffer into the actual type in the lexer, but this might be a bit too early in the parsing-chain for some formats.¨

Getting some feedback on such a fundamental change as this would be great.

More:

  • Lifetimes are an issue too here, since borrowing a token prevents the following parsers from parsing the input until the reference has been dropped. Eg. if Token isn't Copy things like this would be forbidden:

    peek_next(i).then(|i, c| if c == Foo { any(i) } else { i.ret(Bar) } )

    Which is a common pattern internally (even if it is just inlined). This means that any peek needs to return by value, necessitating at least Clone.

  • Adding the Copy bound on a returned buffer prevents pretty much all container-types from being used in tokens in a multi-stage parser. Referring to slices of the input is feasible, but that satisfies Buffer + Copy while restricting the resulting parser to only work on slices.