jamietre/HtmlParserSharp

Exception throw if HTML element contains XMLNS attribute

cmwoods opened this issue · 2 comments

I'm getting the following exception thrown when the HTML element contains the XMLNS attribute (in XHTML document):

Unhandled Exception: System.ArgumentException: The namespace declaration attribute has an incorrect 'namespaceURI': ''.
   at System.Xml.XmlDocument.AddAttrXmlName(String prefix, String localName, String namespaceURI, IXmlSchemaInfo schemaInfo)
   at System.Xml.XmlDocument.CreateAttribute(String prefix, String localName, String namespaceURI)
   at System.Xml.XmlElement.SetAttribute(String localName, String namespaceURI, String value)
   at HtmlParserSharp.XmlTreeBuilder.CreateHtmlElementSetAsRoot(HtmlAttributes attributes) in [...]\HtmlParserSharp\TreeBuilders\XmlTreeBuilder.cs:line 120
   at HtmlParserSharp.Core.TreeBuilder`1.AppendHtmlElementToDocumentAndPush(HtmlAttributes attributes) in [...]\HtmlParserSharp\Core\TreeBuilder.cs:line 5237
   at HtmlParserSharp.Core.TreeBuilder`1.StartTag(ElementName elementName, HtmlAttributes attributes, Boolean selfClosing) in [...]\HtmlParserSharp\Core\TreeBuilder.cs:line 2775
   at HtmlParserSharp.Core.Tokenizer.EmitCurrentTagToken(Boolean selfClosing, Int32 pos) in [...]\HtmlParserSharp\Core\Tokenizer.cs:line 1155
   at HtmlParserSharp.Core.Tokenizer.StateLoop(TokenizerState state, Char c, Int32 pos, Char[] buf, Boolean reconsume, TokenizerState returnState, Int32 endPos) in [...]\HtmlParserSharp\Core\Tokenizer.cs:line 2249
   at HtmlParserSharp.Core.Tokenizer.TokenizeBuffer(UTF16Buffer buffer) in [...]\HtmlParserSharp\Core\Tokenizer.cs:line 1382
   at HtmlParserSharp.SimpleHtmlParser.Tokenize(TextReader reader) in [...]\HtmlParserSharp\SimpleHtmlParser.cs:line 134
   at HtmlParserSharp.SimpleHtmlParser.Parse(TextReader reader) in [...]\HtmlParserSharp\SimpleHtmlParser.cs:line 63

It looks like the code is not particularly expecting XHTML input and therefore doesn't have a special case for the handling of this attribute.

Changed CreateHtmlElementSetAsRoot in XmlTreeBuilder.cs to have the following within the for loop (hack for error):

                string uri = attributes.GetURI(i);
                if (attributes.GetLocalName(i) == "xmlns" && string.IsNullOrWhiteSpace(uri))
                {
                    uri = "http://www.w3.org/2000/xmlns/";
                }
                rv.SetAttribute(attributes.GetLocalName(i), uri, attributes.GetValue(i));

I don't know if this is actually the correct thing to do or not but it at least gets around the issue.