Missing handling for newlines at EOS for $ with pcre
katef opened this issue · 1 comments
I'm looking at this test from the pcre suite, in testinput6:
/xyz$/
xyz
xyz\n
\= Expect no match
xyz\=noteol
xyz\n\=noteol
This is without /m. We don't match xyz\n
there. Why is it expected to match? Well it seems pcre has special handling for $
at the end of a string:
The dollar character is an assertion that is true only if the current matching point is at the end of the subject string, or immediately before a newline at the end of the string (by default), unless PCRE2_NOTEOL is set.
http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC6
and:
Note, however, that it does not actually match the newline.
(which I presume means they exclude the newline from capture).
Do we want to simulate this behaviour? I'm not sure. maybe. but i'm not sure how we'd identify and dissalow it, either.
It seems that $
in pcre essentially means [\n]?$
. One option might be to handle this during parse, and construct that in the AST. I think that wouldn't even be conditional; we'd always construct [\n]?$
(where $
means the AST_ANCHOR node) for pcre. Unless we're in multiline mode, which isn't supported anyway.
If we do that, we'll need to find a way to exclude this from capturing groups. Either by rejecting the syntax entirely within a capturing group, or perhaps by rewriting the AST such that (xyz$)
→ (?:(xyz)$)