mrabarnett/mrab-regex

Request: \K

mrabarnett opened this issue · 4 comments

Original report by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).


Hi Matthew,
Thank you as always for the terrific engine.
In my view it's one of the very best engines out there.

There are three missing features that have been "talking to me" for a while, and I thought I'd put in some requests. I'm sure you've considered them before, but I'd like to put forward a case for each of them.

In this thread I'll focus on \K.

I realize that \K was originally intended as a workaround for the lack of infinite lookbehind.
Nevertheless, it is an extremely clean and expressive token.

Without \K, you either have to use a lookbehind or capturing groups.
Not a problem, but within long expressions, \K gives you a clean "drop everything matched so far".

Also, I often have to translate many expressions from PCRE to Python. When the PCRE expressions are rich with \K, the absence of \K in regex is a real speed bump.

Thanks in advance for considering it again.

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


As far as I can tell, it would shorten group 0 (the entire match), but not any capture group:

#!python

>>> m = regex.search(r'(abc\Kde)', 'abcde')
>>> m[0]
'de'
>>> m[1]
'abcde'

Therefore, it should also affect the span (start and end position) for group 0, but no other groups.

Is that correct?

Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).


Hi Matthew,

Yes, that's exactly right.

Also note that it's not a magic token: it can appear multiple times. For instance,
abc\Kde|fg\Khij matches de in abcde or hij in fghij

In PCRE, a\Kbc\Kde is legal. This has no point, but I guess the idea is that the token can be dropped anywhere.

You can have it on a single side of an alternation, for instance ab(?:\Kde|fg)
etc.

I know you had EditPadPro at some stage because I recall seeing you on the forum. For testing purposes Jan has a good implementation in EPP and RegexBuddy, except for a minor bug that he plans to fix in the next release (one of the most recent threads on the RB forum).

Regards

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Added in regex 2015.09.14.

Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).


Absolutely fantastic. Thank you so much for this time-saver.

An example for anyone interested in seeing it at work: everything to the left of \K (including the start=> marker) is dropped.

#!python


import regex as mrab
>>> bsk = mrab.compile(r'start=>\K.*')
>>> print(bsk.search('boring stuff start=>interesting stuff'))
<regex.Match object; span=(20, 37), match='interesting stuff'>