wader/fq

Support for non-canonical tags with html

Earnestly opened this issue · 5 comments

  • Arch Linux
  • 0.6.0 (linux amd64)

The definition of a non-canonical tag I am using here is essentially tags which close themselves, i.e. <foobar/>.

There seems to be an issue where fq -d html regards them as opening tags and nests everything else below.

Consider here a simple html document with a non-canonical element (typically seen as <link .../> tags): here the foo tag seems to be regarded as a parent of the bar tag:

~ printf '<html><foo/><bar>baz</bar></html>' | fq -d html
{
  "html": {
    "body": {
      "foo": {
        "bar": "baz"
      }
    },
    "head": ""
  }
}

When forced to regard it as xml the output is more expected:

~ printf '<html><foo/><bar>baz</bar></html>' | fq -d xml
{
  "html": {
    "bar": "baz",
    "foo": ""
  }
}

Is this behaviour intentional?

PS: A workaround for -d html is to use a tool like xml c14n ... from xmlstarlet to canonicalise the input before reading.

wader commented

Hey! thanks for the report

The html parser is based on https://pkg.go.dev/golang.org/x/net/html which i think is a standard compliant html5 parser. My guess is that some well known self-closing elements, like <br/>, are treated differently. For example this is how chrome parses <html><br/><bar>baz</bar></html> and <html><foo/><bar>baz</bar></html>:

Screenshot 2023-06-30 at 21 41 15

I had wondered if browsers special case such tags. I'm not really sure what you can do here except use whatever the golang implementation decides. It seems to be fairly "unspecified" territory here.

wader commented

Yeap i think it's part of the html5 spec which tags to treat specially and looking at the go parser implementation https://github.com/golang/net/blob/master/html/parse.go it looks to be quite hardcoded.

What kind of text do you want to parse? using the xml decoder causes other issues?

I was parsing amazon order pages which somehow ended up with a bunch of <i .../> tags, but I think my source material might have been damaged at some point. Starting with fresh html sources and the problems were gone. The little test cases were just a result of that, but -d html now works fine for my purposes.

This issue can probably be closed and provide some google fodder for people who encounter the same issue.

wader commented

Great 👍 i've sometimes used some jq to "massage" things before doing queries or export to something else, might be a workaround. Also fq -o array=true -d html could be interesting, makes some kind of queries or manipulations easier.