Unicode word boundaries with double bytes words

Question

Unicode word boundaries with double bytes words

loretoparisi opened this issue 5 years ago · 2 comments

Hello I'm using this regex to match word boundary between a non-letter + non-mark, as described here:

var pattern = '(?<!\\pL\\pM*)$1(?!\\pL)'
var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
var wordRegex = new RegExp(pattern.replace('$1', escaped), "g");

But XRegExp will not match the tokens in double bytes chars like in this example below:

var pattern = '(?<!\\pL\\pM*)$1(?!\\pL)'
    var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
    var wordRegex = new XRegExp(pattern.replace('$1', escaped), "g");

    // calculate token begin end 
    var match = null;
    while ((match = wordRegex.exec(text)) !== null) {
      if (match.index > (seen.get(token) || -1)) {
        var wordStart = match.index;
        var wordEnd = wordStart + token.length - 1;
        item.characterOffsetBegin = wordStart;
        item.characterOffsetEnd = wordEnd;
        seen.set(token, wordEnd);
        break;
      }
    }
    return item;
  });
  
  console.log(indexes);

  indexes.forEach(index => {
    if (index.word != text.slice(index.characterOffsetBegin, index.characterOffsetEnd + 1)) {
      console.log("NOT MATCHING!!! " + index.word + " : " + text.slice(index.characterOffsetBegin, index.characterOffsetEnd + 1))
    } else {
      console.log("\tMATCHED " + index.word + " : " + text.slice(index.characterOffsetBegin, index.characterOffsetEnd + 1))  
    }
  });

The result will be

NOT MATCHING!!! 해 : 
NOT MATCHING!!! 롭 : 
NOT MATCHING!!! 단 : 
        MATCHED 거 : 거
        MATCHED 잘 : 잘
NOT MATCHING!!! 알 : 
NOT MATCHING!!! 지 : 
        MATCHED love : love

while if I use the ordinary RegExp it will properly work:

        MATCHED 해 : 해
        MATCHED 롭 : 롭
        MATCHED 단 : 단
        MATCHED 거 : 거
        MATCHED 잘 : 잘
        MATCHED 알 : 알
        MATCHED 지 : 지
        MATCHED love : love

I introduced XRegExp to add full Unicode support to RegExp, and without - in other cases , in fact it perfectly works.

I'm using node v12.16.1.

Answer 1 · 2020-10-29T15:36:27.000Z

Can you please provide a reduced test case that shows an example of a result different than you expect? There's a lot of stuff going on in your code that isn't super easy to follow.

E.g.:

var regex = XRegExp('...');
var str = '...';
var match = XRegExp.exec(str, regex);
// match.index is x, but should be y

Answer 2 · 2021-02-06T23:48:00.000Z

Feel free to reopen this with a simplified/reduced test case that shows the problem more concisely.

Note that XRegExp requires the A (astral) flag for Unicode tokens \p{...} to match double byte characters. You can turn this on automatically for all new XRegExp regexes by running XRegExp.install('astral').