rviscomi/capo.js

Validate `meta[http-equiv]`

rviscomi opened this issue · 0 comments

In general, the WHATWG supports a limited set of keywords that are valid attribute values for http-equiv:

  • content-language
  • content-type
  • default-style
  • refresh
  • set-cookie
  • x-ua-compatible
  • content-security-policy

Notable omissions include:

  • origin-trial
  • etag
  • x-* (besides x-ua-compatible
  • cache-control
  • expires
  • pragma
  • accept-ch
  • content-style-type
  • content-script-type

These are all http-equiv attribute values used by over 100k pages, according to HTTP Archive, in descending order of popularity.

See the full results and query, if interested
http_equiv pages
x-ua-compatible 5,849,869
content-type 4,064,550
origin-trial 3,741,447
etag 432,755
x-wix-published-version 432,595
x-wix-application-instance-id 432,594
x-wix-meta-site-id 432,593
content-language 430,009
cache-control 351,196
expires 301,664
pragma 296,342
accept-ch 232,735
x-dns-prefetch-control 176,712
content-style-type 172,497
content-script-type 136,871
imagetoolbar 97,656
cleartype 93,802
content-security-policy 82,064
refresh 28,243
keywords 28,027
last-modified 14,478
x-xrds-location 13,495
page-enter 11,945
encoding 10,936
description 10,716
x-rim-auto-match 10,361
msthemecompatible 9,564
reply-to 9,113
language 8,653
content-location 6,896
copyright 6,435
x-frame-options 6,323
window-target 4,930
title 4,601
x-ua-compatiable 4,493
page-exit 4,468
pics-label 3,269
screenorientation 3,105
audience 2,378
author 2,140
access-control-allow-origin 2,072
dc.description 1,836
cache 1,759
robots 1,501
distribution 1,464
vary 1,386
x-webkit-csp 1,376
p3p 1,258
revisit-after 1,226
default-style 1,054

Query:

WITH meta AS (
  SELECT
    page,
    LOWER(JSON_VALUE(meta, '$.http-equiv')) AS http_equiv
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.almanac.meta-nodes.nodes')) AS meta
  WHERE
    date = '2023-06-01' AND
    client = 'mobile' AND
    is_root_page
)


SELECT
  http_equiv,
  COUNT(DISTINCT page) AS pages
FROM
  meta
WHERE
  http_equiv IS NOT NULL
GROUP BY
  http_equiv
ORDER BY
  pages DESC

The biggest one that jumps out to me is origin-trial, which is used on ~375k pages. Given that it is explicitly supported and endorsed by Chrome, Edge, and Firefox (Safari doesn't support origin trials) I've left a comment on the WHATWG issue recommending its standardization.

I don't think capo.js should complain about spec validity for these keywords as long as browsers support them. But there are some specific usages worth validating.

http-equiv=content-type

According to the W3C spec, the content attribute of a meta[http-equiv=content-type] tag must be set to a "specially formatted string providing a character encoding name... in exactly the following order":

  1. The literal string "text/html;".
  2. Optionally, one or more space characters.
  3. The literal string "charset=".
  4. One of the following:

The WHATWG further requires that the character encoding name be exactly utf-8 and that:

A document must not contain both a meta element with an http-equiv attribute in the Encoding declaration state and a meta element with the charset attribute present.

capo.js should validate that HTML5 pages set a charset to utf-8 and don't have redundant meta tags. Not sure about HTTP header vs meta tag redundancy, but that's also worth exploring (related #59).