m4rw3r/chomp

Make `Input` a trait

m4rw3r opened this issue · 13 comments

Problem

Currently the input type only allows for slices, and is special cased for situations where it may not be the whole of the input. I cannot provide any line/row/offset counting either since it is a concrete type and an extension with that functionality would impact all code.

This would provide a way to slot in position-aware wrappers to solve #38 neatly.

Proposed solution

Convert Input<I> into a trait, with ret and err as provided methods, the input-token type would be the associated type Token. All the primitive methods (currently provided by InputClone and InputBuffer) are also present but require an instance of the zero-sized type Guard which cannot be instantiated outside of the primitives module (note the private field). The primitives would be reachable through methods on a Primitives trait which has to be used separately (the blanket implementation for all Input makes it possible to easily use it once it is in scope).

use primitives::Guard;
pub use primitives::Primitives;

pub trait Input: Sized {
    type Token;
    type Marker;

    fn ret<T>(self, t: T) -> ParseResult<Self, T> {
        ParseResult(self, t)
    }

    fn _consume(self, usize, Guard)        -> Self;
    fn _buffer(&self, Guard)               -> &[Self::Token];
    fn _is_end(&self, Guard)               -> bool;
    fn _mark(&self, Guard)                 -> Self::Marker;
    fn _restore(self, Self::Marker, Guard) -> Self;
}

pub mod primitives {
    use Input;

    pub struct Guard(());

    pub trait Primitives: Input {
        fn consume(self, n: usize) -> Self {
            self._consume(Guard(()), n)
        }
        fn buffer(&self) -> &[Self::Token] {
            self._buffer(Guard(()))
        }
        fn is_end(&self) -> bool {
            self._is_end(Guard(()))
        }
        fn mark(&self) -> Self::Marker {
            self._mark(Guard(()))
        }
        fn restore(self, m: Self::Marker) -> Self {
            self._restore(Guard(()), m)
        }
    }

    impl<I: Input> Primitives for I {}
}

The mark method is the replacement for InputClone, it should be used with the restore method to restore the state of the Input to the old one.

Pros

  • Input can be implemented directly for slices, eliminating certain branches from parsers and combinators like many, take_while, eof and so on.
  • An Input implementation can be provided for line-counting which could be slotted in to provide line-counting in any existing parsers
  • The mark and restore methods would provide mechanisms allowing types which do not wholly consist of slices to work, though the buffer method is probably not the right choice for that, it will need a change to eg. support ropes.
  • All parsers need to be generic, before we could get away with only concrete types since Input<u8> is a concrete type. Input<Token=u8> will not be a concrete type.

Cons

  • Parser function signature change, very backwards incompatible:

    // old
    fn my_parser<'a, I>(i: Input<'a, I>, ...) -> ParseResult<'a, I, T, E>
    // old, lifetime elision:
    fn my_parser<I>(i: Input<I>, ...) -> ParseResult<I, T, E>
    // new
    fn my_parser<I: Input>(i: I, ...) -> ParseResult<I, T, E>
  • The type I: Input can no longer be guaranteed to be linear since the #[must_use] annotation cannot be put on the concrete type.

    This is probably not an issue in practice since the I type is required by value to create a ParseResult and the ParseResult in turn is ultimately required by the functions which start the parsing.

Switched to extension trait with blanket implementation for calling the primitives, much neater.

The above trait does not allow for zero-copy parsing as it looks right now, so far there are two choices for the buffer return type, to make it possible to perform zero-copy parsing:

Lifetime in the trait

This would make it possible to define different references for the lifetime of the Input::buffer() return:

pub trait Input<'a>: Sized {
    // ...
   fn _buffer(&self) -> &'a [Self::Token];
}

impl<'a, I> Input<'a> for &'a [I] {
    type Token = I;
    // ...
    fn _buffer(&self) -> &'a [Self::Token] { &self }
}

Pros

  • Makes it work compared to the original
  • Uses plain slices for all buffers

Cons

  • Function signature:

    The lifetime annotation cannot be elided in this case, it MUST always be present.

    fn my_parsera<'a, I: Input<'a>>(i: I) -> ParseResult<'a, I, MyType<'a>> { ... }
  • Uses plain slices for all buffers, cannot slot in any other way to store information

Associated type

Yet another associated type, Input::Buffer, which would provide a way to define different lifetimes for the buffer. This could also provide different buffer-types, eg. tendril-based or reference-counted buffers or uniquely owned buffers:

pub trait Input: Sized {
    type Buffer: Buffer<Token=Self::Token>;  // Buffer trait needed to make it possible to parse data
    // ...
    fn _buffer(&self) -> Self::Buffer;
}

impl<'a, I> Input for &'a [I] {
    type Buffer = &'a [I];
}

Pros

  • Different input sources can have different buffer types, eg. owned buffers for an iterator source or references to a common pool of tendrils if the source is a tendril.

Cons

  • Separate trait implemented for all buffers, slice is not guaranteed.

    Structs need to be generic over the buffer type to provide zero-copy parsing:

    struct MyType<B: chomp::Buffer> {
        name: B,
    }
  • The methods implemented on the Buffer might not be what people actually need.

  • To guarantee a slice the functions populating the structs need to guarantee that I: Input<Buffer=&'a[T]> or something similar.

Moving the primitive parser-methods to the Input trait (gated behind Primitives) seems to be a good idea, this does not impose any real restrictions to Input::Buffer since the buffer type is the end result:

  • iter(&self) -> T: Iterator<Item=Self::Token>

    Iterator over the currently loaded tokens, does not consume anything. Used to eg. provide a position for take_while so it can then consume the right number of tokens.

    This might require &mut self for receiver in the case it is a lazy-loading input. Investigate later.

  • first(&self) -> Option<Self::Token>

    Specialized version of iter().next()

  • consume(&mut self, usize) -> Self::Buffer

    Takes n tokens and puts them in a buffer, then moves the internal position n steps forward.

    This is the main provider for zero-copy parsing (provided the implementation of Input supports it). It will provide a slice-reference or similar in the resulting buffer to the consumed tokens.

  • discard(&mut self, usize)

    Moves the internal position n steps forward.

  • min_remaining(&self) -> usize

    Returns the minimum number of tokens remaining. Eg. partial slice returns the remainder of the current slice.

  • is_end(&self) -> bool

    Returns true if the current input instance is the last one.

This currently leaves Input::Buffer unconstrained and will require the functions operating on the resulting buffers themselves (and not just the iterator) to restrict the type to satisfy the operations.

The associated type solution seems to be the best one so far.

Changes compared to 0.2 (so far during test-implementation):

  • fn foo(i: Input<T>, ...) -> ParseResult<T, ...> ->

    fn foo<I: Input<Token=T>>(i: I, ...) -> ParseResult<I, ...>

  • U8Result<T> cannot be used since there is no way to describe an I: Input<Token=u8> in a concrete way

  • SimpleResult cannot be used to help inference of Error type due to inference not resolving the associated type Token into a concrete type.

  • Any slice-returns have to either be replaced by I::Buffer or the input type has to be constrained like:

    fn my_slice_producer<'a, T, I: Input<Buffer=&'a [T]>>(i: I) -> ...
  • Internal State primitive's Error and Incomplete states now carries an Input instead of a slice/none.

    TODO: Check performance implications of this

Performance looks pretty promising initially (commit d42bf98):

     Running target/release/combinators-c90c10c5fe42c185

running 9 tests
test count_vec_10k                  ... bench:       7,389 ns/iter (+/- 457)
test count_vec_10k_maybe_incomplete ... bench:      10,224 ns/iter (+/- 943)
test count_vec_1k                   ... bench:         778 ns/iter (+/- 51)
test many1_vec_10k                  ... bench:       6,742 ns/iter (+/- 1,198)
test many1_vec_10k_maybe_incomplete ... bench:       6,726 ns/iter (+/- 352)
test many1_vec_1k                   ... bench:         997 ns/iter (+/- 74)
test many_vec_10k                   ... bench:       6,827 ns/iter (+/- 1,313)
test many_vec_10k_maybe_incomplete  ... bench:       6,721 ns/iter (+/- 1,367)
test many_vec_1k                    ... bench:         990 ns/iter (+/- 87)

test result: ok. 0 passed; 0 failed; 0 ignored; 9 measured

     Running target/release/http_bench-e295120700f03cf5

running 4 tests
test multiple_requests      ... bench:      46,178 ns/iter (+/- 3,740)
test single_request         ... bench:         634 ns/iter (+/- 26)
test single_request_large   ... bench:       1,000 ns/iter (+/- 56)
test single_request_minimal ... bench:         117 ns/iter (+/- 56)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

Where the maybe incomplete have to consider that it might not be the last slice (corresponds more closely with the current master).

Compared to master (456c5d4):

     Running target/release/combinators-c90c10c5fe42c185

running 6 tests
test count_vec_10k ... bench:       7,795 ns/iter (+/- 456)
test count_vec_1k  ... bench:         830 ns/iter (+/- 177)
test many1_vec_10k ... bench:       6,739 ns/iter (+/- 2,865)
test many1_vec_1k  ... bench:         981 ns/iter (+/- 258)
test many_vec_10k  ... bench:       6,888 ns/iter (+/- 566)
test many_vec_1k   ... bench:         993 ns/iter (+/- 104)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

     Running target/release/http_bench-e295120700f03cf5

running 4 tests
test multiple_requests      ... bench:      48,026 ns/iter (+/- 21,687)
test single_request         ... bench:         670 ns/iter (+/- 75)
test single_request_large   ... bench:       1,016 ns/iter (+/- 294)
test single_request_minimal ... bench:         127 ns/iter (+/- 25)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

This is awesome 👍 I'm currently using chomp for a (toy) language. Would love to capture line numbers for debug purposes and error messages.

@m4rw3r I think chomp is an awesome parser – both performant and ergonomic at the same time which is rare. I see that you haven't pushed any code to this repo for a while. What's the plan with this issue and some of the other currently unreleased features? I know how much work open source is so I'm not trying to put pressure on you, just curious for my own purposes.

I would love to help but honestly I don't understand most of what's going on in the source or this discussion for that matter :)

@ukutaht Hi and thanks for the praise!

I have been busy with some work-related things which have taken priority, but as soon as those are resolved I will continue to work on this. Hopefully that will be in a week or so.

The main reason I have started to write comments and issues (to myself mainly) is that it is easier to have it all in one spot, and having an "idea file" which is not checked into the repository was not so good for planning things :) But having feedback is awesome so anything you can do to help is welcome

The current plan is for me to give this branch a try in real life: #45 and write some parsers with it because there might be some inference issues due to the change from a specific type to a trait (I plan to update my pet project https://github.com/m4rw3r/kopparoxid to that version and rewrite parts of that parser, in addition to that I will be poking a bit more on a TOML parser)

I have already had to change the parse! macro to no longer be completely agnostic over the actual type used in the notation (well, less agnostic compared to before where it is relying on a macro for the <|> functionality). I am not too happy about that, and the tests are necessary to make sure that the inference works where it needs to and that it will not require particularly elaborate type definitions for local variables or functions. The tests also need to be updated to the new underlying architecture using a trait as well (will hopefully help me spot a few more issues but also simplify them a bit since there are fewer states), and the buffer module needs some tweaks.

The current annoyance is that Input::ret(i, foo) needs to be used in some cases for closures since the parameter-type itself is unbound, but it seems like this will not be possible to avoid since Rust does not yet support function-/module-wide type-inference.

After those things are done I can probably release a 0.3.0 version of Chomp and then continue to add more Input implementations which will include line-numbering and other nice things for versions 0.3.1 and forward.

Sounds great!

I could try pointing my chomp dependency to feature/input_trait and see if I can convert my ~300 line parser to use it. It's not an actual real life project but it should provide a data point nonetheless.

@ukutaht Awesome, hopefully there shouldn't be too many changes required, mainly only the function signatures. Also, if there are any particular parsers/combinators you find missing from the core (including any from the list here: #19), or any construction you find yourself doing frequently, just open an issue and I will consider adding it.

So I gave this a try but got discouraged pretty quickly. You've listed this as one of the cons:

Structs need to be generic over the buffer type to provide zero-copy parsing:

This is proving to be very difficult for me. My parser spits out an abstract syntax tree which is used everywhere downstream in the compiler. Making the AST generic over the buffer type is a pretty big change there.

The other thing is, even if I make the AST generic I'm not quite clear on how to use the Buffer trait. For example, if the signature of an identifier in my language's AST becomes:

pub struct Identifier<B: chomp::Buffer> {
    pub name: B,
}

At some point I have to access the name field and get a sensible result. Looking at (https://github.com/m4rw3r/chomp/blob/feature/input_trait/src/buffer/buffer.rs#L15) I'm not quite sure how to achieve that (or if it's possible with the current implementation at all).

Am I going about this the wrong way?

@ukutaht you could put a restriction on the I::Buffer in a where clause to ensure that you get what you need, though that will complicate the function signatures a bit. This is one of the things I am also thinking of a solution for. In some cases you do not really care what the actual representation is while parsing, but in cases like yours when you build the AST right away it is desirable to be able to directly refer to the data itself as a particular type.

The main benefit of it is of course that the type implementing Input could be more efficient in how it allocates memory and passes it around.

@ukutaht I was thinking of more complex input data-types than a plain slice/vector when I designed the types::Buffer and the types::Input and its Buffer associated type. For example this would allow some kind of rope-like datastructure to be used as a source for the parser or some other kind of mutable source (eg. arena like allocation of data which is filled on demand during parsing) where parsed data is thrown away unless it is part of a parse result.

So I am hesitant to impose the restriction of eg. std::ops::Deref<[T]> on the types::Buffer trait since it will require the underlying data to be stored in a contiguous piece of memory and will also make it impossible/unsafe to reallocate/discard data while parsing. I hope that adding external restrictions on the Input::Buffer associated type when calling the parser(s) will be sufficient since that will enable the users of the library to restrict the types of Buffer implementations to have the required properties.

Something like this (from the example of the HTTP-parser) would enable the use of slice-methods on the result while still allowing any kind of type::Buffer implementation which also supports Deref to a slice (eg. both a slice and a Vec<u8>):

fn request<I: U8Input>(i: I) -> SimpleResult<I, (Request<I::Buffer>, Vec<Header<I::Buffer>>)>
  where I::Buffer: Deref<Target=[u8]> {
    // ...
}

But working with slices is still the primary use-case, so I will study this further and see how I can improve the ergonomics when you want buffers which can be derefed to slices. The types::Buffer implementation is not final, same goes for the Input trait itself

The PR got merged