hunterhacker/jdom

XMLOutputter removes newlines between attributes

Closed this issue · 3 comments

When reading an XML which has newlines between attributes, the XMLOutputter does not preserve these even when RAW format is used.

XML source:

<Field name = "foo"
      label = "I am foo"
      width = "100px"/>

The resulting XML after parsing and outputting:

<Field name = "foo" label = "I am foo" width = "100px"/>

Sample Groovy code:

File inFile = new File("test.xml")
File outFile = new File("test-JDOM.xml")

SAXBuilder builder = new SAXBuilder()
Document document = builder.build(inFile)

Format format = Format.getRawFormat();
format.setTextMode(Format.TextMode.PRESERVE)
XMLOutputter outputter = new XMLOutputter(format)

outFile.withWriter { fileWriter ->
    outputter.output(document, fileWriter)
}
rolfl commented

The XML Specification gives no special value to whitespace between attriibutes in an Element Start Tag: https://www.w3.org/TR/xml/#sec-starttags (In fact, even the order of attributes is declared to be insignificant).

This , in part, carries through to both the SAX and DOM parsing specifications where XML parsers (like the xerces parser built in to Java) completely ignore the whitespace, and do not report it, when parsing an XML document. Note that the StartElement SAX method simply lists the attributes, and not the space between them: http://docs.oracle.com/javase/8/docs/api/org/xml/sax/ContentHandler.html#startElement-java.lang.String-java.lang.String-java.lang.String-org.xml.sax.Attributes-

The attributes are recorded with a simple Attributes instance: http://docs.oracle.com/javase/8/docs/api/org/xml/sax/Attributes.html

Because the XML specification gives no significance to space in Elements, and because no standard XML parsers exist which will actually report the space between attributes, the JDOM code has never been written to input this space. As a result, it does not output it either. Further, there is no way in JDOM to manipulate (add, remove, change) the space between attributes programmatically.

Note that the document is semantically identical with large amounts, or just a single space between attributes. Further, the document is semantically identical even if the order of the attributes changes (though JDOM does maintain the order of input attributes, though it will re-order the XML Namespace Declarations, if any)

There are no (standards-conforming) parsers, or any other Java XML libraries (like JDOM) I know of, that will report these specific spaces for you.

The "PRESERVE" format in JDOM refers specifically to the whitespace inside of Element tags (between start/end pairs). JDOM does handle that process (a standard one), correctly.

Thanks for the explanation. This was something I wasn't sure about - what exactly was meant by PRESERVE.

rolfl commented

"Preserve" has a special meaning in XML ( https://www.w3.org/TR/xml/#sec-white-space ) where there is a special XML attribute <sometag ..... xml:space="preserve" ...> that can be set. A conforming parser/system should accurately maintain any whitespace inside an element with the "preserve" attribute set. JDOM does/honours this attribute. It can also be set (with the XMLOutputter's Preserve format) to do it for all elements, not just the ones that are specially marked.