Support for non-canonical tags with html
Earnestly opened this issue · 5 comments
- Arch Linux
- 0.6.0 (linux amd64)
The definition of a non-canonical tag I am using here is essentially tags which close themselves, i.e. <foobar/>
.
There seems to be an issue where fq -d html
regards them as opening tags and nests everything else below.
Consider here a simple html document with a non-canonical element (typically seen as <link .../>
tags): here the foo
tag seems to be regarded as a parent of the bar
tag:
~ printf '<html><foo/><bar>baz</bar></html>' | fq -d html
{
"html": {
"body": {
"foo": {
"bar": "baz"
}
},
"head": ""
}
}
When forced to regard it as xml the output is more expected:
~ printf '<html><foo/><bar>baz</bar></html>' | fq -d xml
{
"html": {
"bar": "baz",
"foo": ""
}
}
Is this behaviour intentional?
PS: A workaround for -d html
is to use a tool like xml c14n ...
from xmlstarlet
to canonicalise the input before reading.
Hey! thanks for the report
The html parser is based on https://pkg.go.dev/golang.org/x/net/html which i think is a standard compliant html5 parser. My guess is that some well known self-closing elements, like <br/>
, are treated differently. For example this is how chrome parses <html><br/><bar>baz</bar></html>
and <html><foo/><bar>baz</bar></html>
:
I had wondered if browsers special case such tags. I'm not really sure what you can do here except use whatever the golang implementation decides. It seems to be fairly "unspecified" territory here.
Yeap i think it's part of the html5 spec which tags to treat specially and looking at the go parser implementation https://github.com/golang/net/blob/master/html/parse.go it looks to be quite hardcoded.
What kind of text do you want to parse? using the xml decoder causes other issues?
I was parsing amazon order pages which somehow ended up with a bunch of <i .../>
tags, but I think my source material might have been damaged at some point. Starting with fresh html sources and the problems were gone. The little test cases were just a result of that, but -d html
now works fine for my purposes.
This issue can probably be closed and provide some google fodder for people who encounter the same issue.
Great 👍 i've sometimes used some jq to "massage" things before doing queries or export to something else, might be a workaround. Also fq -o array=true -d html
could be interesting, makes some kind of queries or manipulations easier.