oblac/jodd-lagarto

Missing & in text output when incomplete/incorrect character references

bfreuden opened this issue · 0 comments

The following program:

import jodd.lagarto.EmptyTagVisitor;
import jodd.lagarto.LagartoParser;
import jodd.lagarto.Tag;

public class Test {

    public static void main(String[] args) {
        EmptyTagVisitor visitor = new EmptyTagVisitor() {

            @Override
            public void tag(final Tag tag) {
                System.out.println("tag: " + tag.getName() + " " + tag.getType());
            }

            @Override
            public void text(final CharSequence text) {
                System.out.println("text: " + text.toString().replace("\n", "\\n"));
            }

        };
        // note the incomplete/incorrect character references
        LagartoParser parser = new LagartoParser("<html><body>&#... &#x...</body></html>");
        parser.parse(visitor);

    }

}

Produces this output (note missing & characters):

tag: html START
tag: body START
text: #... #x...
tag: body END
tag: html END