Missing handling for newlines at EOS for $ with pcre

Question

Missing handling for newlines at EOS for $ with pcre

katef opened this issue 4 years ago · 1 comments

I'm looking at this test from the pcre suite, in testinput6:

/xyz$/
    xyz   
    xyz\n
\= Expect no match
    xyz\=noteol
    xyz\n\=noteol

This is without /m. We don't match xyz\n there. Why is it expected to match? Well it seems pcre has special handling for $ at the end of a string:

The dollar character is an assertion that is true only if the current matching point is at the end of the subject string, or immediately before a newline at the end of the string (by default), unless PCRE2_NOTEOL is set.

http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC6

and:

Note, however, that it does not actually match the newline.
(which I presume means they exclude the newline from capture).

Answer 1 · 2021-04-08T22:05:16.000Z

Do we want to simulate this behaviour? I'm not sure. maybe. but i'm not sure how we'd identify and dissalow it, either.

It seems that $ in pcre essentially means [\n]?$. One option might be to handle this during parse, and construct that in the AST. I think that wouldn't even be conditional; we'd always construct [\n]?$ (where $ means the AST_ANCHOR node) for pcre. Unless we're in multiline mode, which isn't supported anyway.

If we do that, we'll need to find a way to exclude this from capturing groups. Either by rejecting the syntax entirely within a capturing group, or perhaps by rewriting the AST such that (xyz$) → (?:(xyz)$)