zzzprojects/html-agility-pack

LoadHTML leads to wrong result when node begins with underscore

mausoma opened this issue · 2 comments

1. Description

Describe the issue or propose a feature.
var html = @"<_links>A</_links>";

        var htmlInput = new HtmlAgilityPack.HtmlDocument();
        htmlInput.LoadHtml(html);

        Console.WriteLine(html);
        Console.WriteLine(htmlInput.DocumentNode.OuterHtml);

LoadHTML doesn't load the XML correctly. The output is:
<_links>A</_links>
<_links>A

2. Exception

I would expect that the input and OuterHTML are (more or less) the same.
But the end tag of </_links> is missing completely.

4. Any further technical details

  • HAP version: 1.11.54
  • NET Framework 4.7.2

Hello @mausoma ,

HtmlAgilityPack is an HTML parser. A tag in HTML cannot start with an underscore.

Best Regards,

Jon

Additionally to what Jonathan already pointed out, i want to point out that it is possible to check whether the input data has been successfully parsed without errors. To do so, check HtmlDocument.ParseErrors after loading the input data for any reported error.

In your case, HtmlDocument.ParseErrors should contain an error relating to the invalid <_links> element. However, unfortunately the reported error is a bit misleading, as it indicates "Start tag <_links> was not found" instead of an error referring to "_links" being an invalid element name. :-(

(P.S.: I am just a user and not affiliated with the HAP project or its maintainers.)