zeek/spicy

Parse error during regexp parsing could synchronize earlier

Mohan-Dhawan opened this issue · 9 comments

I modified the synchronize-on-gap.test btest to reproduce a behavior I am seeing with a pcap that has gaps in content. The modifications are as follows:

The synchronization token is AB.

public type Xs = unit {
        xs: (/AB/ &synchronize)[];
        on %synced { confirm; }
};

The input is as follows

int main() {
    hilti::rt::init();

    auto xs = hilti::rt::reference::make_value<__hlt::sync::Xs>();

    hilti::rt::ValueReference<hilti::rt::Stream> stream;
    stream->append("A");
    stream->append(nullptr, 1024); // Gap.
    stream->append("AB"); // should sync here, but is skipped
    stream->append("AB"); // instead gets synch'd here

    hlt::sync::Xs::parse2(xs, stream, {}, {});
    std::cout << xs << '\n';

    hilti::rt::done();
}

It generates a spicy-verbose log as below clearly indicating that the synchronization happens with the second token while the first is skipped.

  [spicy-verbose] - state: type=sync::Xs input="A<gap>..." stream=0x55e452d602e0 offsets=0/0/0/1029 chunks=4 frozen=no mode=default trim=yes lah=n/a lah_token="n/a" recovering=no
  [spicy-verbose] - parsing production: Unit: sync_Xs -> xs
  [spicy] sync::Xs
  [spicy-verbose]   - state: type=sync::Xs input="A<gap>..." stream=0x55e452d602e0 offsets=0/0/0/1029 chunks=4 frozen=no mode=default trim=yes lah=n/a lah_token="n/a" recovering=no
  [spicy-verbose]   - parsing production: While: xs  -> while(<look-ahead-found>): anon
  [spicy-verbose]     failed to parse list element, will try to synchronize at next possible element
  [spicy-verbose]     - state: type=sync::Xs input="A<gap>..." stream=0x55e452d602e0 offsets=0/0/0/1029 chunks=4 frozen=no mode=default trim=yes lah=n/a lah_token="n/a" recovering=yes
  [spicy-verbose]     - trimming input
  [spicy-verbose]     - state: type=sync::Xs input="ABAB" stream=0x55e452d602e0 offsets=1025/0/1025/1029 chunks=2 frozen=no mode=default trim=yes lah=n/a lah_token="n/a" recovering=yes
  [spicy-verbose]     - state: type=sync::Xs input="ABAB" stream=0x55e452d602e0 offsets=1025/0/1025/1029 chunks=2 frozen=no mode=default trim=yes lah=n/a lah_token="n/a" recovering=yes
  [spicy-verbose]     - trimming input
  [spicy-verbose]     - state: type=sync::Xs input="BAB" stream=0x55e452d602e0 offsets=1026/0/1026/1029 chunks=2 frozen=no mode=default trim=yes lah=n/a lah_token="n/a" recovering=yes
  [spicy-verbose]     - state: type=sync::Xs input="BAB" stream=0x55e452d602e0 offsets=1026/0/1026/1029 chunks=2 frozen=no mode=default trim=yes lah=n/a lah_token="n/a" recovering=yes
  [spicy-verbose]     - trimming input
  [spicy-verbose]     - state: type=sync::Xs input="AB" stream=0x55e452d602e0 offsets=1027/0/1027/1029 chunks=1 frozen=no mode=default trim=yes lah=1 lah_token="AB" recovering=yes
  [spicy-verbose]     successfully synchronized
  [spicy-verbose]     - state: type=sync::Xs input="AB" stream=0x55e452d602e0 offsets=1027/0/1027/1029 chunks=1 frozen=no mode=default trim=yes lah=1 lah_token="AB" recovering=no
  [spicy-verbose]     - parsing production: Ctor: anon -> /AB/ (regexp) (container 'xs')
  [spicy-verbose]       - consuming look-ahead token
  [spicy-verbose]       - trimming input
  [spicy-verbose]       - trimming input
  [spicy-verbose]     - got container item
  [spicy-verbose]     suspending to wait for more input for stream 0x55e452d602e0, currently have 0

I agree it would be nice to resynchronize earlier in this case, but since we are still able to synchronize I'd say that'd be an improvement.

This behavior seems to depend on us hitting an gap while matching a regex, and we would match on the earliest next pattern if instead e.g.,

  • the input was AB, <gap>, AB, AB (extracts three b"AB"; similar behavior if the stream had e.g., C in place of <gap>), or
  • Xs was expecting b"AB" instead of /AB/.

I agree it would be nice to resynchronize earlier in this case, but since we are still able to synchronize I'd say that'd be an improvement.

To me this is a bug, because from a user's perspective, there is no reason why it shouldn't sync on the pattern after the gap.

To me this is a bug, because from a user's perspective, there is no reason why it shouldn't sync on the pattern after the gap.

Still sounds to me like a definite improvement, not a bug. In the docs we make no hard promise for a certain behavior, but describe possible behaviors, but we could look into softening that more.

The docs say

Whenever an error occurs during parsing, Spicy will determine the closest synchronization point in the grammar following the error’s location, and then attempt to continue processing there by skipping ahead in the input data until it aligns with what that field is looking for.

So clearly the first token should be the closest synchronization point.

The docs say

Whenever an error occurs during parsing, Spicy will determine the closest synchronization point in the grammar following the error’s location, and then attempt to continue processing there by skipping ahead in the input data until it aligns with what that field is looking for.

So clearly the first token should be the closest synchronization point.

This doesn't seem to be that clear since this is about what synchronization point is picked in the grammar, e.g., given

type Message = unit {
    a: b"A";
    b: b"B" &synchronize;
    c: b"C" &synchronize;
};

a parse error while parsing a will synchronize on b, not c since it wouldn't be closest. This has nothing to do with the input stream.

I want to abstract away the details of the grammar here because that can be very confusing. What I wanted to highlight in my previous comment was this part of the statement ---- skipping ahead in the input data until it aligns with what that field is looking for. To me, this indicates that the first instance of the token should be a natural choice for alignment.

I just talked offline with @J-Gras and he shared additional context.

The parser they are working on runs into the behavior documented here which leads to it spending more time in recovery than needed (pattern right after a gap, but instead forced to synchronize on a following pattern). This not only causes them to potentially parse less data than possible, but also being forced into relatively slow regexp matching during recovery which is slower than their usual parsing; this severely reduces the throughput of their parser when faced with gaps.

relatively slow regexp matching during recovery which is slower than their usual parsing

This is an interesting statement. I don't think regex matching should fundamentally be slower than other parsing, might be worth digging in more if there's some obvious inefficiency somewhere.

Yes, the patch works. Thanks!