onetrueawk/awk

Ancient awk regexp compatibility bug

Closed this issue · 6 comments

Using the code from master as of today, I found the following bug. Given:

BEGIN {
	print match("abc-def", /[qrs---tuv]/)
}

The One True Awk prints a result of 0, whereas gawk and mawk print 4. Ancient awks (and I think it's even documented in the awk book) allowed a "range" of minus through minus to mean a real actual minus sign. The current code doesn't support this anymore.

plan9 commented

interesting find. i now think this must be a historic implementation wart. thanks

Harumph. It looks like plain matching of --- works:

$ echo xxx-y | ./a.out '/[a---q]/'
xxx-y

So maybe it's just an issue with thematch() function?

Hi arnold, plan9:

(I completely changed the content of this post before anyone responded but quite a few hours after initially posting it. Hopefully I didn't cause any confusion or inconvenience.)

After looking at the code, I think I understand what's happening. cclenter in b.c does not understand the triple-minus idiom. When it detects an invalid range, where the end point precedes its starting point, it backs up and drops the range.

In the initial report, [qrs---tuv] becomes [qr-tuv] (invalid range s-- dropped).
In the other example, [a---q] becomes [-q] (invalid range a-- dropped).

Take care,
Miguel

plan9 commented

hi miguel, apologies for the late response, this is correct, I have tested and found this weeks ago. I'm not clear on a clean fix at the moment.

plan9 commented

triple minus should not be called an idiom. i think on its own, it's a twisted but legit construct that works with most [all?] regular expression engines [including one I built decades ago] but in combination with other characters in the range, may or may not work, depending on the engine. not many implementors will go through the kind of contortions eg. mawk regex engine goes through to handle this, nor should they.

Let's close this issue since it's not going to change.