Incorrect text output for magnifying glass emoji

Question

Incorrect text output for magnifying glass emoji

Closed this issue 4 years ago · 4 comments

The following program:

import jodd.lagarto.EmptyTagVisitor;
import jodd.lagarto.LagartoParser;
import jodd.lagarto.Tag;

public class Test {

    public static void main(String[] args) {
        EmptyTagVisitor visitor = new EmptyTagVisitor() {

            @Override
            public void tag(final Tag tag) {
                System.out.println("tag: " + tag.getName() + " " + tag.getType());
            }

            @Override
            public void text(final CharSequence text) {
                System.out.println("text: " + text.toString().replace("\n", "\\n"));
                System.out.println(Character.toChars(0x1F50E));
            }

        };
        // note the magnifying glass emoji: https://www.codetable.net/hex/1f50e
        LagartoParser parser = new LagartoParser("<html><body>Search &#x1F50E;</html>");
        parser.parse(visitor);
    }

}

Produces the following output:

tag: html START
tag: body START
text: Search 
🔎
tag: html END

It might be a matter of character references that are surrogate pairs.
Maybe a "codepoint & 0xFFFF" somewhere in the code?

Answer 1 · 2020-11-22T23:14:54.000Z

Yeah, it seems that I get the correct number, but:

c = (char) value;

breaks it.

Answer 2 · 2020-11-22T23:19:14.000Z

btw, I will release the fixed version soon, if you don't have any more issues 👍

Thank you VERY MUCH for finding and reporting these nasty bugs!s

Answer 3 · 2020-11-23T13:28:18.000Z

Wow that was fast!
Thank you so much for those lightning-fast fixes!
This is all I have so far in terms of bugs :-).
If you release a new version I will definitely give it a try on common-crawl data 👍

I do have a remark concerning the use case of knowing the position of texts in the input (something similar to tagPosition and tagLength) though. If you're ok I might open a new ticket to share that with you.

Answer 4 · 2020-11-24T22:03:26.000Z

@bfreuden Sure I will release the fixes this week(end) :)

I am open to all ideas, please do share them!