MXParser tokenization fails when PI is before first tag

Question

MXParser tokenization fails when PI is before first tag

belingueres opened this issue 5 years ago · 3 comments

Consider these valid XML documents:

<?a?>
<test>nnn</test>

and

<?xml version="1.0" encoding="UTF-8"?>
<?a?>
<test>nnn</test>

Those tests fail then parsing the PI, returning instead START_DOCUMENT.


   @Test
    public void testProcessingInstructionTokenizeBeforeFirstTag()
        throws Exception
    {
        String input = "<?a?><test>nnn</test>";

        MXParser parser = new MXParser();
        parser.setInput( new StringReader( input ) );

        assertEquals( XmlPullParser.PROCESSING_INSTRUCTION, parser.nextToken() );
        assertEquals( XmlPullParser.START_TAG, parser.nextToken() );
        assertEquals( XmlPullParser.TEXT, parser.nextToken() );
        assertEquals( XmlPullParser.END_TAG, parser.nextToken() );
    }

    @Test
    public void testProcessingInstructionTokenizeAfterXMLDeclAndBeforeFirstTag()
        throws Exception
    {
        String input = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><?a?><test>nnn</test>";

        MXParser parser = new MXParser();
        parser.setInput( new StringReader( input ) );

        assertEquals( XmlPullParser.PROCESSING_INSTRUCTION, parser.nextToken() );
        assertEquals( XmlPullParser.PROCESSING_INSTRUCTION, parser.nextToken() );
        assertEquals( XmlPullParser.START_TAG, parser.nextToken() );
        assertEquals( XmlPullParser.TEXT, parser.nextToken() );
        assertEquals( XmlPullParser.END_TAG, parser.nextToken() );
    }

Answer 1 · 2020-03-31T22:53:29.000Z

I think the problem is the parsePI() method: it returns false when parsing the xml declaration (<?xml ...?>), which sets the event to PROCESSING_INSTRUCTION, but returns true when encounters another PI, which set the event to START_DOCUMENT. The logic should be exactly the inverse.
However this breaks several tests (and surely several apps).
A middle point may be to always return false from parsePI() so that always return PROCESSING_INSTRUCTION? WDYT?

Answer 2 · 2022-05-17T20:14:31.000Z

Just tested, tests still fail.

Answer 3 · 2023-05-10T13:27:54.000Z

@belingueres @michael-o I also come up with the same fix. Also, according to the javadoc for START_DOCUMENT, this event should only be returned before the parser actually parses anything, which goes in the same way.
The PR does that and fixes both tests.