Unicode word boundaries with double bytes words
loretoparisi opened this issue · 2 comments
Hello I'm using this regex to match word boundary between a non-letter + non-mark, as described here:
var pattern = '(?<!\\pL\\pM*)$1(?!\\pL)'
var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
var wordRegex = new RegExp(pattern.replace('$1', escaped), "g");But XRegExp will not match the tokens in double bytes chars like in this example below:
var pattern = '(?<!\\pL\\pM*)$1(?!\\pL)'
var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
var wordRegex = new XRegExp(pattern.replace('$1', escaped), "g");
// calculate token begin end
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(token) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + token.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(token, wordEnd);
break;
}
}
return item;
});
console.log(indexes);
indexes.forEach(index => {
if (index.word != text.slice(index.characterOffsetBegin, index.characterOffsetEnd + 1)) {
console.log("NOT MATCHING!!! " + index.word + " : " + text.slice(index.characterOffsetBegin, index.characterOffsetEnd + 1))
} else {
console.log("\tMATCHED " + index.word + " : " + text.slice(index.characterOffsetBegin, index.characterOffsetEnd + 1))
}
});The result will be
NOT MATCHING!!! 해 :
NOT MATCHING!!! 롭 :
NOT MATCHING!!! 단 :
MATCHED 거 : 거
MATCHED 잘 : 잘
NOT MATCHING!!! 알 :
NOT MATCHING!!! 지 :
MATCHED love : love
while if I use the ordinary RegExp it will properly work:
MATCHED 해 : 해
MATCHED 롭 : 롭
MATCHED 단 : 단
MATCHED 거 : 거
MATCHED 잘 : 잘
MATCHED 알 : 알
MATCHED 지 : 지
MATCHED love : love
I introduced XRegExp to add full Unicode support to RegExp, and without - in other cases , in fact it perfectly works.
I'm using node v12.16.1.
Can you please provide a reduced test case that shows an example of a result different than you expect? There's a lot of stuff going on in your code that isn't super easy to follow.
E.g.:
var regex = XRegExp('...');
var str = '...';
var match = XRegExp.exec(str, regex);
// match.index is x, but should be yFeel free to reopen this with a simplified/reduced test case that shows the problem more concisely.
Note that XRegExp requires the A (astral) flag for Unicode tokens \p{...} to match double byte characters. You can turn this on automatically for all new XRegExp regexes by running XRegExp.install('astral').