Changing invalid markup parsing behavior

Question

Changing invalid markup parsing behavior

Treycos opened this issue 4 years ago · 5 comments

Hi, I'm trying to parse XML files from a forum that may contain invalid matching tags.

A simple example of what I have to process would be the following syntax :

<a>
  <b>
</a>
    Text
</b>

Since the b closing tag isn't found before a, the posthtml-parser algorithm handles it by bring the closing b tag upward in the tree:

<a>
  <b>
  </b>
</a>
    Text

However, the data is supposed to be understood in the following way:

<a>
  <b>
    Text
  </b>
</a>

Instead of bringing the b closing tag upward, the a closing tag is brought downward, at the first spot where it makes sense.
Is there an option within the parser to make it handle mismatching this way ?

Thank you for your help

Answer 1 · 2020-07-10T06:38:07.000Z

@Treycos Hi, use options singletags

Answer 2 · 2020-07-10T08:19:07.000Z

I'm not sure i understand how I should use it to solve the problem I described.

The end goal is to convert the corrected XML into a JSON, which is why I posted it within the posthtml-parser repo.

Answer 3 · 2020-07-10T08:34:41.000Z

I checked your example on xmlvalidation and it turned out to be incorrect because i got an error

3: | 3 | The element type "b" must be terminated by the matching end-tag "</b>".

stool I concluded that you may have sealed and xml longer to be like that:

<a>
  <b>
    Text
  </b>
</a>

posthtml-parser parsing correctly building AST and smoothed out your mistakes.

@Treycos Hi, use options singletags

Yes, I did not immediately see the second closing element </b> and thought that maybe you want to have this element like single

Answer 4 · 2020-07-10T08:44:57.000Z

Yeah the given XML is invalid. I do not have any control over the API providing the file though, the system that originally handles it parses the invalid syntax into the following one:

<a>
  <b>
    Text
  </b>
</a>

I'm trying to find a way to handle the file in the same way as the old system does

Answer 5 · 2020-09-28T10:47:28.000Z

Do not hesitate to reopen the problem if you still have questions.