zeek/spicy

Filter generates code with incompatible iterators

Closed this issue · 6 comments

Using the filter or not in the following example should not make a difference, however the version with connected filter throws an error from generated code.

# @TEST-EXEC: echo "foo" | spicy-dump -d %INPUT 1>&2

module foo;

public type X = unit {
    headers: Y[];
};

type Y = unit {
    on %init {
        self.connect_filter(new F);
    }

    : b"foo";
};

type F = unit {
    %filter;
    : bytes &eod {
        self.forward($$);
    }
};
[fatal error] terminating with uncaught exception of type hilti::rt::InvalidIterator: incompatible iterator (../filter-fail.spicy:14:7)

This seems to have been broken since at least v1.2.0.

The exception here occurs since lahe is inconsistent with the filtered view Y is operating on. Even when fixing that I cannot make this work as expected since the parser consumes too much data and breaks lookahead parsing for headers which looks a lot like #1652.

I believe we have a conceptual challenge here in the way look-ahead operates. With a look-ahead rule, the parser has a set of choices in terms of input stream tokens possibly coming next. It extracts the next token out of the current stream, confirms it's one of the expected choices, and then continues according to that choice. (In this particular example, the set of choices actually has only one element, Foo, which means to continue with another Y. If that choice is not found, there's a default epsilon rule that will be taken instead, continuing at the top-level, i.e., terminating the array).

The problem is that this scheme doesn't work when a filter is involved, because now we have two separate input streams that, in general, don't have any relationship. That means the notion of "next token" becomes ill-defined because (a) it could be coming out of either stream and (b) it's not even clear what "next" means because the two streams, in general, aren't tied together in their stream positions.

As far as I can see, this is not solvable in general, and hence I think we need to reject usage of look-aheads across unit involving filtered input. I'm wondering if we could make some special cases work, like the one here where we only have one actual token and know which stream it must be coming from; but even then, not sure we'd know where to continue at the top-level stream once the token isn't found anymore (because we'd be making that decision somewhere in the middle of filtered data).

There are similar issues around &parse-from and likely &parse-at as well. They both set the input to a different stream which interferes with lookahead parsing. We should probably reject lookahead parsing for any of them unless we can make it work with a dedicated workaround.

There are similar issues around &parse-from and likely &parse-at as well.

I was wondering about that while I was writing my comment, and I'm not sure. The difference is that the data these parse, is completely separate from the current stream; it doesn't affect anything going on inside the main unit. I believe we can/should just ignore them for look-ahead processing from the perspective of the main unit. But I'm actually not sure if we do that right now, it's something that the grammar builder would need to account for.

We should look into rejecting such code.

#1722 now rejects the original code here. I also confirmed that &parse-from and &parse-at don't cause trouble.