zzzprojects/html-agility-pack

HtmlDocument shows `<link>foo</link>` tag as just `<link>foo`

Closed this issue · 4 comments

Description

If I pass a raw string that contains <link>foo</link> to htmlDocument.LoadHtml(raw),
then output htmlDocument.DocumentNode.OuterHtml, it will show up as <link>foo (without the closing tag).

And similarly, if I configure htmlDocument.OptionWriteEmptyNodes = true; , the output will be <link />foo, perhaps indicating that it think it's an empty node?

Note: my input is not strictly expected to be a web page, I know <link> might have special meaning. But I'd still like to be able to load it as a regular node.

Fiddle

https://dotnetfiddle.net/QASHg5

The <link> element in HTML does not support any content apart from attributes and therefore also does not feature an end tag. (specification). And HAP - being a HTML parser - tries to parse it as a regular HTML <link> element. So, that's why you get what you see...

Looking a bit around in HAP's source code, there seems to be a way to achieve what you want. The HtmlAgilityPack.HtmlNode class maintains a static dictionary HtmlNode.ElementsFlags that assigns certain element characteristics to certain element names. For the link element name, the dictionary characterizes it to be an empty element.

Since HtmlNode.ElementsFlags is publicly accessible, it is sufficient to remove the entry for link from this dictionary to get the desired result:

HtmlNode.ElementsFlags.Remove("link");

var html = @"<root><link>foo</link><url>bar</url></root>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
...

Note that due to HtmlNode.ElementsFlags being a static field, modifying or replacing its assigned dictionary will affect all parsing done by HAP in your application.

(P.S.: I am just a user of HAP and not associated with the project nor its authors/maintainers.)

Thank you @elgonzo for your help again. Your answer is 100% correct.

Let us know if you have additional question about this @Davidsv

Best Regards,

Jon

Perfect, this is good enough for me. Thank you both. Closing