posthtml/posthtml-render

Changing invalid markup parsing behavior

Treycos opened this issue · 5 comments

Hi, I'm trying to parse XML files from a forum that may contain invalid matching tags.

A simple example of what I have to process would be the following syntax :

<a>
  <b>
</a>
    Text
</b>

Since the b closing tag isn't found before a, the posthtml-parser algorithm handles it by bring the closing b tag upward in the tree:

<a>
  <b>
  </b>
</a>
    Text

However, the data is supposed to be understood in the following way:

<a>
  <b>
    Text
  </b>
</a>

Instead of bringing the b closing tag upward, the a closing tag is brought downward, at the first spot where it makes sense.
Is there an option within the parser to make it handle mismatching this way ?

Thank you for your help

Scrum commented

@Treycos Hi, use options singletags

I'm not sure i understand how I should use it to solve the problem I described.

The end goal is to convert the corrected XML into a JSON, which is why I posted it within the posthtml-parser repo.

Scrum commented

I checked your example on xmlvalidation and it turned out to be incorrect because i got an error

3: | 3 | The element type "b" must be terminated by the matching end-tag "</b>".

stool I concluded that you may have sealed and xml longer to be like that:

<a>
  <b>
    Text
  </b>
</a>

posthtml-parser parsing correctly building AST and smoothed out your mistakes.

@Treycos Hi, use options singletags

Yes, I did not immediately see the second closing element </b> and thought that maybe you want to have this element like single

Yeah the given XML is invalid. I do not have any control over the API providing the file though, the system that originally handles it parses the invalid syntax into the following one:

<a>
  <b>
    Text
  </b>
</a>

I'm trying to find a way to handle the file in the same way as the old system does

Scrum commented

Do not hesitate to reopen the problem if you still have questions.