Changing invalid markup parsing behavior
Treycos opened this issue · 5 comments
Hi, I'm trying to parse XML files from a forum that may contain invalid matching tags.
A simple example of what I have to process would be the following syntax :
<a>
<b>
</a>
Text
</b>
Since the b
closing tag isn't found before a
, the posthtml-parser
algorithm handles it by bring the closing b
tag upward in the tree:
<a>
<b>
</b>
</a>
Text
However, the data is supposed to be understood in the following way:
<a>
<b>
Text
</b>
</a>
Instead of bringing the b
closing tag upward, the a
closing tag is brought downward, at the first spot where it makes sense.
Is there an option within the parser to make it handle mismatching this way ?
Thank you for your help
@Treycos Hi, use options singletags
I'm not sure i understand how I should use it to solve the problem I described.
The end goal is to convert the corrected XML into a JSON, which is why I posted it within the posthtml-parser
repo.
I checked your example on xmlvalidation and it turned out to be incorrect because i got an error
3: | 3 | The element type "b" must be terminated by the matching end-tag "</b>".
stool I concluded that you may have sealed and xml
longer to be like that:
<a>
<b>
Text
</b>
</a>
posthtml-parser
parsing correctly building AST and smoothed out your mistakes.
@Treycos Hi, use options singletags
Yes, I did not immediately see the second closing element </b>
and thought that maybe you want to have this element like single
Yeah the given XML is invalid. I do not have any control over the API providing the file though, the system that originally handles it parses the invalid syntax into the following one:
<a>
<b>
Text
</b>
</a>
I'm trying to find a way to handle the file in the same way as the old system does
Do not hesitate to reopen the problem if you still have questions.