katef/libfsm

Missing handling for newlines at EOS for $ with pcre

katef opened this issue · 1 comments

katef commented

I'm looking at this test from the pcre suite, in testinput6:

/xyz$/
    xyz   
    xyz\n
\= Expect no match
    xyz\=noteol
    xyz\n\=noteol 

This is without /m. We don't match xyz\n there. Why is it expected to match? Well it seems pcre has special handling for $ at the end of a string:

The dollar character is an assertion that is true only if the current matching point is at the end of the subject string, or immediately before a newline at the end of the string (by default), unless PCRE2_NOTEOL is set.

http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC6

and:

Note, however, that it does not actually match the newline.
(which I presume means they exclude the newline from capture).

katef commented

Do we want to simulate this behaviour? I'm not sure. maybe. but i'm not sure how we'd identify and dissalow it, either.

It seems that $ in pcre essentially means [\n]?$. One option might be to handle this during parse, and construct that in the AST. I think that wouldn't even be conditional; we'd always construct [\n]?$ (where $ means the AST_ANCHOR node) for pcre. Unless we're in multiline mode, which isn't supported anyway.

If we do that, we'll need to find a way to exclude this from capturing groups. Either by rejecting the syntax entirely within a capturing group, or perhaps by rewriting the AST such that (xyz$)(?:(xyz)$)