Opaque Host and domain encoding
Opened this issue · 6 comments
What is the issue with the URL Pattern Standard?
Here's a test case from the WPT test suite, it's expected to succeed (Chrome 123 pass).
const up1 = new URLPattern({ hostname : "xn--caf-dma.com" });
up1.exec({ hostname: "café.com" });
The test expect the hostname "café.com" to be converted to "xn--caf-dma.com" and to match the pattern.
But as per the spec, without a special scheme the host is considered an opaque host and not a domain name, therefore it should be percent-encoded to "caf%C3%A9.com" and not IDNA encoded to "xn--caf-dma.com". The pattern won't match.
Is it an oversight in the test suite/Chrome implementation or should hostname always be considered domain name ?
In some case the specs already allows opaque URL to be treated as special (cf. process pathname for init) and default the URL processing to the most common (most expected??) behaviour.
Should it be the case here too ? Or maybe only in some cases... Like forcing the URL to a special scheme when the host is not an address and domain labels are detected ?
After some more digging, most of the fail tests in this issue #206 are related to this opaque host issue. I think the idea was that these test should (at least) fail when checking forbidden domain code point in the host parser. This checking is only done in the URL parser for non opaque host.
As for the domain encoding issue, all these cases produce URL encoded hostname:
new URLPattern({ hostname : "café.com" }); // hostname pattern string: "caf%C3%A9.com"
new URLPattern({ protocol: "https", hostname : "café.com" }); // hostname pattern string: "caf%C3%A9.com"
new URLPattern("https://café.com" }); // hostname pattern string: "caf%C3%A9.com"
The first one seems logic (except if we consider that every non-adress hostname with unknown protocol should be considered domain) but the other two should be IDNA encoded domain name since they are non opaque/special URLs. It seems everything that goes into the pattern compiler is parsed separately, without context, so the hostname is always considered opaque. We can see the right behaviour when the domain is passed in the base URL :
new URLPattern("", "https://café.com" }); // hostname pattern string: "xn--caf-dma.com"
@jeremyroman Maybe we should pass necessary URL context (the protocol) to the canonicalize a hostname method, for it to be properly interpreted by the URL parser.
And should input without unspecified protocol be force a special protocol ?
I guess ideally you can get both.
In fact, it's exactly what canonicalize a port is already doing : passing the protocol as context to normalize default port values.
And as pointed out above process pathname for init already fallback on "special URL" processing when protocol is unspecified.
So it would only make sense hostname are treated the same ;)
hi @annevk, I finally tried to look into this. It's pretty simple to fix the hostname parsing for regular processing but there's no easy solution to fix it for pattern processing.
The canonicalization functions are passed as callback parameter to the pattern parser. Internally, the pattern parser have no context for what it's currently parsing (scheme, hostname, port...) or what kind of protocol is used for the current URL pattern. So every hostname is always URL encoded.
Without adding URL context to the pattern parser, the easiest (but not so elegant) way would be to URL decode the hostname pattern after pattern parsing and if an empty or special scheme re-encoded it with the host parser. Would that be OK ?
I wonder if @domenic @wanderview or @jeremyroman have some ideas. I don't currently have the bandwidth to look into this.
@jeremyroman @domenic I found a decent way to fix this issue :
- in canonicalize hostname: add a protocolValue param, and set dummyURL.scheme if supplied.
- add a canonicalize domain name method: a simple wrapper around canonicalize hostname with param protocolValue always set to 'https', to force the host parser in non-opaque mode.
- in process hostname for init: add a protocolValue param and a condition if special scheme or empty protocol return the result of calling the canonicalize domain name method with hostnameValue.
- in process URLPatternInit: pass result.protocol as protocolValue to process hostname for init.
- in URLPattern initialize steps: add a condition for hostname if protocol component matches special scheme or protocolComponent pattern is '*', use the CanonicalizeDomainName as encoding callback.
It encodes properly hostname and hostname patterns. If it's OK for you I can make a PR.