/broken-xml

Parser that can parse any broken and invalid XML.

Primary LanguageJavaMIT LicenseMIT

Maven Central Build Status codecov

Broken XML is a parser that can parse any broken and invalid XML. This parser should not be used by any normal human being (who can use these parsers). But if you're lucky like myself, just read further...

Add via Maven

<dependency>
  <groupId>com.guseyn.broken-xml</groupId>
  <artifactId>broken-xml</artifactId>
  <version>${broken-xml.last-version}</version>
</dependency>

Build from sources

mvn clean package -Plocal

Jar file is target/broken-xml-<version>.jar.

Or you can just install last version of jar file in the releases section.

API

Broken XML works only with a simple String on input:

public class Main {
    static void main(String[] args){ 
        XmlDocument document = new ParsedXML("<xml>...</xml>").document();
        // You can get list of head elements, if for some reason you have several of them
        List<HeadElement> heads = document.heads(); 
        // You can get multiple roots
        List<Element> roots = document.roots();
        // You can even get comments in your xml
        List<Comment> comments = document.comments();
    }
}

Structure

XmlDocument

XmlDocument is what you get by calling new ParsedXML(xmlAsString).document().

XmlDocument document = new ParsedXML(xmlAsString).document();
// Components:
List<HeadElement> heads = document.heads();
List<Element> roots = document.roots();
List<Comment> comments = document.comments();
int start = document.start(); // is always 0
int end = document.end(); // is always a length of xml string
HeadElement

HeadElement represents head of XML. It's an element that looks like <?xml ... ?>.

XmlDocument document = new ParsedXML(xmlAsString).document();
HeadElement head = document.heads().get(0);
// Components:
List<Attribute> attributes = head.attributes();
int start = head.start();
int end = head.end();
Element

Element can be either a root or just a child node in xml.

XmlDocument document = new ParsedXML(xmlAsString).document();
Element element = document.roots().get(0); // can be aslo retrieved from another element via children() method
// Components:
String name = element.name();
List<Attribute> attributes = element.attributes();
List<Element> children = element.children();
List<Text> texts = element.texts();
int start = element.start();
int end = element.end();
Attribute

Attribute can be either a component of HeadElement or Element.

XmlDocument document = new ParsedXML(xmlAsString).document();
Element element = document.roots().get(0);
Attribute attribute = element.attributes().get(0); 
// Components:
String name = attribute.name();
String value = attribute.value();
int nameStart = attribute.nameStart();
int nameEnd = attribute.nameEnd();
int valueStart = attribute.valueStart();
int valueEnd = attribute.valueEnd();
Text

Text is a component of Element.

XmlDocument document = new ParsedXML(xmlAsString).document();
HeadElement element = document.heads().get(0);
Element element = document.roots().get(0)
Text text = element.texts().get(0) 
// Components:
String value = text.value();
int start = text.start();
int end = text.end();
Comment

Comment is a component of XmlDocument.

XmlDocument document = new ParsedXML(xmlAsString).document();
Comment comment = document.comments().get(0);
// Components:
String text = comment.text();
int start = comment.start();
int end = comment.end();

How broken is your XML?

Empty xml

If you have an empty xml, no problem, you'll get just empty XmlDocument:

public class EmptyXmlTest {
    @Test
    public void test() {
        final ParsedXML xml = new ParsedXML("");
        XmlDocument document = xml.document();
        assertEquals(document.heads().size(), 0);
        assertEquals(document.roots().size(), 0);
        assertEquals(document.start(), 0);
        assertEquals(document.end(), 0);
    }
}
XML that is wrapped with some other text

Broken XML allows you to have xml text with no XML stuff, in such case it will return information only about XML part:

public class NoXmlAroundXmlTest {
    @Test
    public void test() {
        final ParsedXML xml = new ParsedXML("Some text here<root attr=\"value\">text</root>and some text here");
        XmlDocument document = xml.document();
        assertEquals(document.roots().size(), 1);
        assertEquals(document.roots().get(0).name(), "root");
        assertEquals(document.roots().get(0).texts().get(0).value(), "text");
    }
}
Multiple roots

Valid xml contains only one root element. But Broken XML does not care and returns multiple roots as a list:

public class MultipleRootsTest {
    @Test
    public void test() {
        final ParsedXML xml = new ParsedXML("<root1></root1><root2></root2>");
        XmlDocument document = xml.document();
        assertEquals(document.roots().size(), 2);
        assertEquals(document.roots().get(0).name(), "root1");
        assertEquals(document.roots().get(1).name(), "root2");
    }
}
Duplicate attributes in elements

It does not matter anymore if elements in your xml have duplicate attribute names, Broken XML will return a list of them:

public class DuplicateAttributesInElementTest {
    @Test
    public void test() {
        final ParsedXML xml = new ParsedXML("<elm attr=\"value1\" attr=\"value2\"></elm>");
        XmlDocument document = xml.document();
        Element element = document.roots().get(0);
        assertEquals(element.attributes().size(), 2);
        assertEquals(element.attributes().get(0).name(), "attr");
        assertEquals(element.attributes().get(0).value(), "value1");
        assertEquals(element.attributes().get(1).name(), "attr");
        assertEquals(element.attributes().get(1).value(), "value2");
    }
}
Some tags are not closed

You can have xml with unclosed tags:

<root>
  <elm1 attr="value">
    text
  </elm1>
  <elm2 attr="value" attr="value">text
</root>

That's fine, Broken xml parses such things:

public class SomeTagsAreNotClosedTest {
    @Test
    void test() {
        final ParsedXML xml = new ParsedXML(xmlFromFileAsString);
        XmlDocument document = xml.document();
        assertEquals(document.roots().size(), 1);
        assertEquals(document.roots().get(0).children().size(), 2);
        assertEquals(document.roots().get(0).children().get(1).name(), "elm2");
        assertEquals(document.roots().get(0).children().get(1).texts().get(0).value(), "text\n");
        assertEquals(document.roots().get(0).children().get(1).texts().get(0).end(), 86);
        assertEquals(document.roots().get(0).children().get(1).end(), 86);
    }
}
No closing tags at all

Who needs closing tags anyway, right?

<root>
  <elm1 attr="value" attr="value">
    <elm2 attr="value" attr="value">
      <elm3 attr="value" attr="value">
        <elm4 attr="value" attr="value">
          <elm5 attr="value" attr="value">
            <elm6 attr="value" attr="value">text

That's fine, Broken xml parses even such things:

public class NoClosingTagsAtAllTest {
    @Test
    void test() {
        final ParsedXML xml = new ParsedXML(xmlFromFileAsString);
        XmlDocument document = xml.document();
        assertEquals(document.roots().get(0).children().size(), 1);
        assertEquals(document.roots().get(0).children().get(0).name(), "elm1");
        assertEquals(document.roots().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).attributes().get(0).value(), "value");
        assertEquals(document.roots().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).attributes().get(0).value(), "value");

        assertEquals(document.roots().get(0).children().get(0).children().size(), 1);
        assertEquals(document.roots().get(0).children().get(0).children().get(0).name(), "elm2");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).attributes().get(0).value(), "value");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).attributes().get(0).value(), "value");

        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().size(), 1);
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).name(), "elm3");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).value(), "value");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).value(), "value");

        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().size(), 1);
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).name(), "elm4");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).value(), "value");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).value(), "value");

        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().size(), 1);
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).name(), "elm5");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).value(), "value");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).value(), "value");

        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().size(), 1);
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).name(), "elm6");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).value(), "value");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).attributes().get(0).value(), "value");
        assertEquals(document.roots().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).children().get(0).texts().get(0).value(), "text");
    }
}
Swapped opening and closing tags

Obviously Broken XML does not care if names in opening and closing tags of elements match:

<elm1>
  <elm2>text</elm1>
</elm2>

Broken XML can easily eat such stuff:

public class SwappedOpeningAndClosingTags {
    @Test
    public void test() {
        final ParsedXML xml = new ParsedXML(xmlFromFileAsString);
        XmlDocument document = xml.document();
        assertEquals(document.roots().size(), 1);
        assertEquals(document.roots().get(0).children().size(), 1);
        assertEquals(document.roots().get(0).name(), "elm1");
        assertEquals(document.roots().get(0).children().get(0).name(), "elm2");
        assertEquals(document.roots().get(0).children().get(0).texts().get(0).value(), "text");
    }
}
Non-escaped brackets inside of elements

Broken XML can handle brackets <, > inside of elements if they are not really part of element tags:

<elm1>
  <><<
  <elm2><><< some text<><< other text</elm2>
</elm1>

It will be parsed with no problems:

public class NonEscapedBracketsInTexts extends XmlSource {
    @Test
    @Override
    void test() throws IOException {
        final ParsedXML xml = new ParsedXML(xmlFromFileAsString);
        XmlDocument document = xml.document();
        assertEquals(document.roots().get(0).name(), "elm1");
        assertEquals(document.roots().get(0).texts().get(0).value(), "\n  <><<\n  ");
        assertEquals(document.roots().get(0).children().get(0).name(), "elm2");
        assertEquals(document.roots().get(0).children().get(0).texts().get(0).value(), "<><< some text<><< other text");
    }
}

Important note: this works if only bracket < are not followed by any valid element name symbol, otherwise it's impossible even for Broken XML

Non-escaped quotes in attribute values

Guess what else Broken XML can do. You don't have to escape quotes anymore:

<elm attr=""va""lu""e">

</elm>

It will be parsed with no problems:

public class NonEscapedQuotesTest extends XmlSource {
    @Test
    @Override
    void test() throws IOException {
        final ParsedXML xml = new ParsedXML(xmlFromFileAsString);
        XmlDocument document = xml.document();
        assertEquals(document.roots().size(), 1);
        assertEquals(document.roots().get(0).name(), "elm");
        assertEquals(document.roots().get(0).attributes().get(0).name(), "attr");
        assertEquals(document.roots().get(0).attributes().get(0).value(), "\"va\"\"lu\"\"e");
    }
}

Important note: it works only if non-escaped quotes are not followed by space or > symbol (remember it's impossible to read your mind, luckily for you).

Impossible even for Broken XML

In this section everything will be parsed without any errors but not in the way that you'd expect, because Broken XML should also be able to parse valid XML as well. And there are some exceptional situations where it's impossible to predict what exactly is needed to parse. So, basically, following cases are not resolvable even theoretically and you have to remember what you'll get from the parser if they happens.

Different types of opening and closing quotes for attribute values

Sorry, but if you will have something like this:

<root attr1='value1">
  text1
</root>
<root attr2="value2'>
text2
</root>

it will parsed like an element with attribute that has value which is xml-like text:

public class DifferentTypesOfOpeningAndClosingQuotesForAttributeValuesTest {
    @Test
    public void test() {
        final ParsedXML xml = new ParsedXML(xmlFromFileAsString);
        XmlDocument document = xml.document();
        assertEquals(document.start(), 0);
        assertEquals(document.end(), 73);
        assertEquals(document.roots().size(), 1);
        assertEquals(document.roots().get(0).attributes().size(), 1);
        assertEquals(document.roots().get(0).attributes().get(0).name(), "attr1");
        assertEquals(document.roots().get(0).attributes().get(0).value(), "value1\">\n  text1\n<root>\n<root attr2\"value2");
    }
}
Open bracket is follwed by valid element name symbol

You can use non-escaped brackets, but if open bracket < is followed by any valid name symbol:

<elm><><<sometext
</elm>

Then it will be parsed as part of new tag:

public class OpenBracketIsFollowedByElementNameSymbolTest {
    @Test
    void test() throws IOException {
        final ParsedXML xml = new ParsedXML(xmlFromFileAsString);
        XmlDocument document = xml.document();
        assertEquals(document.roots().get(0).name(), "elm");
        assertEquals(document.roots().get(0).children().get(0).name(), "sometext");
    }
}
No closing tags with mutiple roots

Let's say you have following invalid xml:

<root1>
  <elm></elm>
</root1>
<root2>
  <elm></elm>
</root2>
<root3>
  <elm1 attr="value" attr="value">
    <elm2 attr="value" attr="value">
      <elm3 attr="value" attr="value">
        <elm4 attr="value" attr="value">
          <elm5 attr="value" attr="value">
            <elm6 attr="value" attr="value">text

Broken XML in such case assumes that you closed <root2> prematurely and will add <root3> as child element to <root2>.

Why? Well, imagine you have just one root that is not closed, do you really want to create another root for unclosed elements? Or let's say you don't have closed elements in your root(which is closed), we don't really want to create a root for non-closed elements which are in our root, right? So, in another words we will have logical errors in our parser if we do otherwise, and technically it's impossible to detect such tiny things in xml format.

Just remember what you'll get in such exceptional situations. And for God's sake just fix your XMLs.

Non-closed comment

If you forgot to close comment:

<elm>
  <!--sfsef
</elm>

then sorry, but everything till the end will be parsed as a comment(but will be parsed anyway!):

public class NonClosedCommentTest extends XmlSource {
    @Test
    @Override
    void test() throws IOException {
        final ParsedXML xml = new ParsedXML(
            dataByPath("non-closed-comment.xml")
        );
        XmlDocument document = xml.document();
        assertEquals(document.roots().size(), 1);
        assertEquals(document.comments().size(), 1);
        assertEquals(document.comments().get(0).text(), "sfsef\n<elm>");
    }
}

What about CDATA

Due to different technical reasons it's decided that it's better to parse <![CDATA[...]]> as text inside of element. If <![CDATA[...]]> is outside of element scope, then it just will not be parsed (like in any normal xml parser).

Running checkstyle

mvn checkstyle:checkstyle