jqlang/jq

Zero length regular expression match misbehaviour

weeble opened this issue · 4 comments

Description

Some regular expressions behave weirdly when matches are zero length - they miss matches or match too often.

Reproduction

Case 1

Input:

jq --null-input --compact-output '"xx"|match("";"g")'

Output:

{"offset":0,"length":0,"string":"","captures":[]}
{"offset":1,"length":0,"string":"","captures":[]}

Expected:

{"offset":0,"length":0,"string":"","captures":[]}
{"offset":1,"length":0,"string":"","captures":[]}
{"offset":2,"length":0,"string":"","captures":[]}

There should be a match at the end of the string too.

Case 2

Input:

$ jq --null-input --compact-output '"xx"|match("$";"g")'

Output:

{"offset":2,"length":0,"string":"","captures":[]}
{"offset":2,"length":0,"string":"","captures":[]}

Expected:

{"offset":2,"length":0,"string":"","captures":[]}

The same match shouldn't show up twice. (Extra matches seem to be returned in number equal to the length of the string, for some reason. But they're all the same match at the same offset.)

Environment

  • OS and Version: Ubuntu
  • jq version: master branch, commit cff5336

Additional context

I stumbled on this trying to fix #2148 and found that my attempted solution didn't behave like I expected it to. I tried to isolate the problem and ended up with these.

Okay, having looked at f_match, in particular this handling for zero-width matches and the loop termination condition, I see some issues. Here's another test case:

jq --null-input --compact-output '"xx"|match("|x";"g")'

This outputs:

{"offset":0,"length":0,"string":"","captures":[]}
{"offset":1,"length":0,"string":"","captures":[]}

But it should output:

{"offset":0,"length":0,"string":"","captures":[]}
{"offset":0,"length":1,"string":"x","captures":[]}
{"offset":1,"length":0,"string":"","captures":[]}
{"offset":1,"length":1,"string":"x","captures":[]}
{"offset":2,"length":0,"string":"","captures":[]}

Example using PCRE on regex101

Hmmm, oniguruma doesn't make this easy. There doesn't seem to be any way to call onig_search to get the next match that isn't a zero-width match at the start position. It seems you can either call it with a callback and get all the matches, which could be very inefficient, or you call it with a start offset and get only one match. This also makes it clear that my approach in #2564 is going to be way more costly than I thought, because I was assuming that match(...; "g") returns results lazily, and it doesn't look like it does.

I wonder if you would get the right result if you called it with the callback and filtered out every match that overlaps with the previously replaced match.

This seems very thorny. It might be worth reviewing what other users of oniguruma do to implement this. It doesn't even seem like there's universal agreement between regex engines, especially with awkward cases like "xx"|match("x?";"g") and "xx"|match("|x";"g")

Please note that Case 2 in the original post has been resolved by 83f375c :

jqMaster -nM '"xx"|[match("$";"g")] | length'
1

Note also that in this repaired version of jq, the following is nevertheless incorrect:

jqMaster -nM '"x"|sub("";"y"; "g")'
"yx"