Incorrect text output for magnifying glass emoji
Closed this issue ยท 4 comments
The following program:
import jodd.lagarto.EmptyTagVisitor;
import jodd.lagarto.LagartoParser;
import jodd.lagarto.Tag;
public class Test {
public static void main(String[] args) {
EmptyTagVisitor visitor = new EmptyTagVisitor() {
@Override
public void tag(final Tag tag) {
System.out.println("tag: " + tag.getName() + " " + tag.getType());
}
@Override
public void text(final CharSequence text) {
System.out.println("text: " + text.toString().replace("\n", "\\n"));
System.out.println(Character.toChars(0x1F50E));
}
};
// note the magnifying glass emoji: https://www.codetable.net/hex/1f50e
LagartoParser parser = new LagartoParser("<html><body>Search 🔎</html>");
parser.parse(visitor);
}
}
Produces the following output:
tag: html START
tag: body START
text: Search ๏
๐
tag: html END
It might be a matter of character references that are surrogate pairs.
Maybe a "codepoint & 0xFFFF" somewhere in the code?
Yeah, it seems that I get the correct number, but:
c = (char) value;
breaks it.
btw, I will release the fixed version soon, if you don't have any more issues ๐
Thank you VERY MUCH for finding and reporting these nasty bugs!s
Wow that was fast!
Thank you so much for those lightning-fast fixes!
This is all I have so far in terms of bugs :-).
If you release a new version I will definitely give it a try on common-crawl data ๐
I do have a remark concerning the use case of knowing the position of texts in the input (something similar to tagPosition and tagLength) though. If you're ok I might open a new ticket to share that with you.