parsing node content that contains xml/html-style tags

Question

parsing node content that contains xml/html-style tags

Closed this issue 6 years ago · 1 comments

Given the following XML:

<xml>
  <node>some data</node>
  <italics>some <i>italic</i> data</italics>
</xml>

Parsing I get:

{
    "node": "some data",
    "italics": {
        "_Data": "some data",
        "i": "italic"
    }
}

I suppose this output makes sense, but is there some possibility to get it as follows with an option?

{
    "node": "some data",
    "italics": {
        "_Data": "some <i>italic</i> data"
    }
}

Of course, I could replace the  tags in the string before parsing (I don't need them preserved), but it would be neat to have an option to alter that behaviour if possible.

Answer 1 · 2018-08-20T05:36:40.000Z

Yeah, I do apologize, but this behavior is as designed. This library really isn't designed to parse HTML or HTML-like complex elements with mixed text and child elements at the same level in the hierarchy. See Issue #17 for a more detailed explanation.

There is really no easy way to fix this due to the way the library was designed, which uses nested recursive regular expressions. The only way to get what you want would be to pre-entitize the  and  into  and  respectively before parsing.

I am sorry for this shortcoming, but the library was only designed to parse simple XML configuration files into a simple hash/array tree.