Inconsistency between `node.OuterHtml`, `node.EndNode.OuterHtml`, and `node.Closed`
rvishruth opened this issue · 6 comments
1. Description
Inconsistency between node.OuterHtml
, node.EndNode.OuterHtml
, and node.Closed
HtmlAgilityPack automatically closes certain nodes. However, when it does, the EndNode
does not get updated accordingly.
Related follow-up questions:
- Is it possible to disable this auto-closing behavior?
- If not, how can one detect if any changes were made to the original HTML?
- Why are there no ParseErrors being logged?
2. Exception
N/A
3. Fiddle or Project
https://dotnetfiddle.net/wDNJfT
From output,
p, <p>Hello, , <>, True
- Endnode is <>
and the <p>
tag is not closed
strong, <strong>world!</strong>, <>, True
- EndNode is <>
even though the OuterHTML has a closing tag
Hello @rvishruth ,
1. Is it possible to disable this auto-closing behavior?
No, it's currently not possible. However, it look to be easily possible by adding 2 simple options:
DisableImplicitEndLogic
DisableExplicitEndLogic
The <>
is usually when there is no EndNote
. There is indeed no implicit end in the case of an implicit end.
We can surely add those options, but then you will have some side impact as some of your tag is not closed. So you will probably not get the desired behavior even with those options.
2. If not, how can one detect if any changes were made to the original HTML?
Besides comparing the HTML, I don't think there is anyway.
3. Why are there no ParseErrors being logged?
Not all parsing errors is logged. Implicit ending like <p>a<p>b
is not a parsing error but a valid HTML syntax, so therefore, it's not logged.
Let me know if that answer correctly to your question.
Best Regards,
Jon
Hi @JonathanMagnan! Thank you for your response!
(2) and (3) makes sense to me!
Regarding (1), could you expand on what you mean by
The <> is usually when there is no EndNote. There is indeed no implicit end in the case of an implicit end.
Is this not a bug (because the Endnode is <> even though the there was a closing tag added)?
Hello @rvishruth ,
Is this not a bug (because the Endnode is <> even though the there was a closing tag added)?
Indeed, it can be seen as a bug. I would have expected a null value for the EndNode
as there is no EndNode
. I would not have expected the OuterHtml
to either be <>
but empty instead.
But we didn't think about it years ago when we added the ImplicitEnd
logic, and modifying this part of the code would cause some breaking change impact for other developers, so that's not something we are currently comfortable fixing.
Best Regards,
Jon
I understand, thanks @JonathanMagnan! Few more follow up questions:
Indeed, it can be seen as a bug. I would have expected a null value for the EndNode as there is no EndNode. I would not have expected the OuterHtml to either be <> but empty instead.
This makes sense for the p tag case, where there was no closing tag added. But for the strong tag case, HAP adds a closing tag, so, I'm not sure if a future change will make it so that the EndNode
will be </strong>
. With this in mind, what would be the way to detect if the original HTML had any improperly closed tags present? Would it be to iterate through every node and check if the EndNode == "<>"
?
- Why is
node.Closed
True
forp, <p>Hello, , <>, True
(see fiddle)?
Hello @rvishruth ,
Just to let you know, I'm not the original creator of this library (we purchased it as it was no longer maintained), so what I'm saying is what I understand.
The strong
tag is automatically closed because the p
tag just before gets closed (so everything inside the p
tag needs to be closed.
Unlike a p
tag where the end tag is optional (See Tag omission
section here, the strong
requires an end tag (see Tag omission
here.
So HAP automatically added the end tag when writing the HTML, but when he parsed the HTML, there was indeed no end tag, so no EndNode
. Is it a bug? Is it a normal behavior? At this point, I have no idea. I can just explain what is happening.
I'm not really sure what the best way to achieve your actual behavior would be. HAP is not 100% HTML compatible so you will surely get some weird behavior especially with invalid HTML.
Thank you for the explanation @JonathanMagnan! I Given the current functionality, I think checking if EndNode is <> might be the best way to detect if there was a closing tag missing (apart from comparing with the original HTML).
I leave it up to you if you want to keep the issue open. I do think the current behavior is inconsistent, but its also unclear if its a bug.