Include codepoint sets for each component and for each standard
alwinb opened this issue · 0 comments
Many of the differences between URI, IRI and the WHATWG URL variants (valid, tolerated, sanitised) are about allowing different codepoints to occur verbatim within the various components.
I would like to include a section that contains all the different codepoint sets for each of the relevant components, and then parameterise the grammar.
This goes a long way towards describing the differences between the WHATWG URL standard and RFC3986 and RFC3987, and between the three variants of WHATWG URLs themselves.
The aim is to provide a generalised grammar, and express the various forms of validity across the three standards 'semantically' as constraints on the parse tree and the allowed codepoints within components. There will be a few remaining issues around drive-letters and invalid percent sequences potentially, but other than that I think that this can work.
Steps:
- Add a section on percent coded strings underneath, or within the preliminaries.
- Add a section explaining the 5 variants of URLs?
- Include 5 tables, one for each of the URL variants; with 5 character sets each (one per relevant component type).
- Parameterise the grammar and try to unify the strict- and non-strict grammar.