firasdib/Regex101

quantifier after \Q \E block not allowed, but in pcre2test it is

Closed this issue · 4 comments

Bug Description

A quoting environment \Q ... \E can be quantified, according to pcre2test, I just found. This is not documented in the man page, as far as I could tell, so it surprised me. Your tool (in PCRE2 mode) says it can't be quantified, and this seems to be in accordance with the man page. However, if pcre2test says othersiwe, isn't that the true reference?

Reproduction steps

Entering ^\Qa\E+$ in regex101's input field in PCRE2 mode results in the quantifier marked as an error and the following text is shown:

+ The preceding token is not quantifiable

However, it works fine with pcre2test. A single character inside the quoting can match:

echo -n '/^\Qa\E+$/debug@aa@' | tr -s "@" "\n"|pcre2test
PCRE2 version 10.39 2021-10-29
/^\Qa\E+$/debug
------------------------------------------------------------------
  0   7 Bra
  3     ^
  4     a++
  6     $
  7   7 Ket
 10     End
------------------------------------------------------------------
Capture group count = 0
Compile options: <none>
Overall options: anchored
First code unit = 'a'
Subject length lower bound = 1
aa
 0: aa

But If I have more than one charatcer in the class, I cant make anything match, but the tool still allows the expression as syntax correct. I can also quantify with {N,M}.

A also tried aa in the subject, and it matched as well, but aaa did not match.
Here is another weird result:

echo -n '/^\Qaaa\E+$/debug@aaaa@' | tr -s "@" "\n"|pcre2test
PCRE2 version 10.39 2021-10-29
/^\Qaaa\E+$/debug
------------------------------------------------------------------
  0  11 Bra
  3     ^
  4     aa
  8     a++
 10     $
 11  11 Ket
 14     End
------------------------------------------------------------------
Capture group count = 0
Compile options: <none>
Overall options: anchored
First code unit = 'a'
Subject length lower bound = 3
aaaa
 0: aaaa

Shorter aa sequences didn't match.

Expected Outcome

I expect regex101 to give the same result as pcre2test, when in PCRE2 mode, but maybe this should be seen as a flaw in pcre2test: I don't know.

Best regards,
David

Browser

Include browser name and version
google-chrome Version 124.0.6367.201 (Official Build) (64-bit)

OS

Ubuntu 22.04

I experienced the same with Python.

a*+a says + The preceding token is not quantifiable even though it's from the Python documentation.

I tested in Python and it works as expected, no complaints there.

Thanks for reporting this, I have added support for it in the new version. Do note that this behavior only quantified the last character in the list, not the entire construct.

Thanks for reporting this, I have added support for it in the new version. Do note that this behavior only quantified the last character in the list, not the entire construct.

Thanks! Yes, it quantifies on the last character. I have checked many corner cases concerning \Q \E blocks, and empty ones (such as \Q\E and isolated \E). Inserting empty quotings in { , } quantifiers force them to be interpreted as literal strings instead of quantifiers. But inside character classes they are removed completely, and don't affect the parse.
If I find features of PCRE2 that your tool dosn't support, should I report it here? I guess you are already aware of them, but for instance unicode properties like \p{armi} and friends, are not supported, as far as I know.

Feel free to report all discrepancies you can find.