google/gumbo-parser

invalid attribute names should be ignored

vinc17fr opened this issue · 16 comments

Invalid attribute names should be ignored. This includes unexpected or duplicate quote mark. For instance, consider the following cases: <body "> and <body !=foo>. With the callback (used via the HTML::Gumbo Perl module), events associated with these invalid attribute names are generated, while no attributes should be generated in such a case. Compare with what HTML Tidy does.

In practice, I got the error when I wanted to parse the Inria home page, which currently contains: <body class="path-frontpage has-glyphicons noTouchDevice"">

Compare with what HTML Tidy does.

Compare with what Chrome, Firefox, etc. do.

A general purpose parsing library should follow the specification, not aim to replicate some other tool's behaviour.

Chrome, Firefox, etc. are third-party browsers. HTML Tidy comes from the W3C and is meant to follow the specifications. That's why I've mentioned it.

The relevant parts of the spec are:

  • §12.2.5.39 "After attribute value (quoted) state" ("anything else" clause)
  • §12.2.5.32 "Before attribute name state" ("anything else" clause)
  • §12.2.5.33 "Attribute name state" ("quotation mark" clause)

From what I can see, Gumbo is doing exactly what the specification requires in this case.

The HTML Tidy behaviour may actually be useful, since it's purpose is to "tidy" bad HTML, but it's not "following the specifications" here like a user agent is supposed to.

By the way, I'm just commenting as an outside party. This project has been more or less abandoned for ~4 years.

In any case, gumbo generates an invalid attribute name: 12.1.2.3 Attributes. This is not correct.

In any case, gumbo generates an invalid attribute name: 12.1.2.3 Attributes.

See the note above for §12.1:

§12.1 Writing HTML documents

This section only applies to documents, authoring tools, and markup generators. In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").

Requirements for authors and requirements for parsers are not the same thing. The spec encourages authors to write "well formed" documents. However, parsers have to be much more lenient in what they accept, due to a massive legacy of existing web content.

This is not correct.

Following the steps given for parsers, it seems quite correct to me.

There are GumboError and GumboParseFlags types that are recorded during parsing. They should encode the various errors and unusual conditions mentioned in the spec. There's sufficient information encoded in the parse tree to determine which nodes were ill-formed, parser-inserted, etc. but it takes some extra effort to use it.

Invalid attribute names make the XML::LibXML Perl module crash. There needs to be some method to convert HTML5 to XHTML5 without requiring the converter to perform additional well-formedness checking. I'm using the HTML::Gumbo Perl module (with the callback method), and information about the errors is not available there. Perhaps the issue is in this module, then.

if what you want is xhtml from html5 and want info on errors, there are other forks of gumbo that are actively maintained, that will accept xhtml and html inputs, serialize a parse tree back to xhtml, and provide parse error info.

One is used in the epub editor and one is used in Calibre. See here for the Sigil fork (I maintain it and will respond to error reports)

https://github.com/Sigil-Ebook/sigil-gumbo

and Calibre's is here: ( Kovidgoyal maintains it and does respond to bug reports as well).

https://github.com/kovidgoyal/html5-parser

and both of these have borrowed heavily from the many many bug fixes and developments/speedups done by many of the people from this site.

If it helps, both of the above have python3 bindings and work well with bs4 and are actively used in epub related projects.

FWIW, trying a callback approach on parsing non-xml, non-strict html with an html5 spec compliant parser is going to be a pain as the spec calls for specific errors, injected nodes, reordering and reparenting and even resequencing poorly nested html, etc.

You may want to post process the already built tree. This is how the bindings to python do it. See Calibre's implementation that binds to python for example.

I've opened a Debian bug to ask for a replacement by sigil-gumbo.

With the HTML::Gumbo Perl module, the tree format is in alpha stage. That's why I did not choose this method.

You may want to modify that to choose the html5-parser as it will show a better way to bind to a language like Perl (as it is doing the same for Python).

Since Sigil is a C++ project, we use the gumbo built tree directly and added our own post build simple tree modification ability and then wrappered it in a C++ class to make all of the higher level functionality typically found in a DOM library.

You may want to modify that to choose the html5-parser

I was under the impression that html5-parser only offers a Python API? Do they support building and installing libgumbo for library use too?

It seems to me like none of the half dozen or so forks actually support being linked as a shared library. I don't think there's anything that can really replace a libgumbo distro package, unless I'm mistaken about html5-parser.

I'm not sure about the details of this particular use case. I just recall seeing html5-parser described as "... HTML 5 parsing for python", so presumably they don't support linking to their internal copy of Gumbo as a shared library. Of course, that's not to say it couldn't be taken and adapted. It just doesn't seem like an out-of-the-box solution, if that's what's being sought.

Using a callback approach on top of gumbo is just not going to work easily to create nodes on the fly for processing non-strict html unless I am missing something.

Yeah, a callback approach with Gumbo makes no sense whatsoever. The Hubbub library offers a streaming/callback interface, but IMO the API is horrible compared to Gumbo (albeit for legitimate reasons).

Good point. What about your own fork?

There must be 3 or 4 active forks out there all doing something a bit different. It is a real shame they did not pass ownership of this project to someone to keep it alive or at least pull in the bug fixes that people have posted about here.

What about your own fork?

I forked the library because it seemed like the best route to fixing the bugs that affected lua-gumbo. I'd probably contribute to a proper fork of the library, if one were to emerge, but I don't really want to lead such an effort.