whatwg/urlpattern

Constructor string parser has confusing result for paths containing a colon.

Opened this issue · 2 comments

What is the issue with the URL Pattern Standard?

The colon is treated as defining a protocol, even in cases where the URL parser would not (because it looks for a leading alphabetic character).

Consider, e.g., new URLPattern('/scope\\:0/*', 'https://example.com/') (note that the : is already escaped for tokenizing purposes). This fails - in Chromium, the error reads:
TypeError: Failed to construct 'URLPattern': Invalid protocol pattern '/scope'. Invalid protocol '/scope'.

This is because of how the init branch ends up trying to figure out if there's a protocol. It took me some time to figure out how to work around this (use {} around the literal to prevent the parser from thinking too hard about it, basically).

But the URL parser doesn't have this issue. Should we do something similar here? While in general it's hard to tell if the first character is alphabetic because it might begin with a wildcard or regex of some kind, in the majority of cases it's not ambiguous and we could succeed. For example, we might simply say that if it begins with a char token and that token is not alphabetic, we don't try to treat it as a protocol. Anything along those lines would fix the common case of a relative URL beginning with a /, ? or #.

However, genuinely relative ones are still ambiguous and maybe it is more consistent and simpler to just fail. An example might be new URLPattern('http:bar', 'https://example.com/') could hypothetically be trying to refer to a file in the same directory named http:bar.

But the URL parser will already trip you up if you're trying to do that, so I'm not too worried. I think all the cases where my proposed resolution changes things are cases where the pattern would previously have been invalid, so it should be compatible to change.

Thoughts?

For reference, the URL spec's handling of this is at https://url.spec.whatwg.org/#scheme-start-state

rotu commented

There are some other examples that may be relevant from the URL spec:

e.g. new URLPattern("hello:world", "https://example.com/") resolves to {protocol:"https", hostname:"example.com", pathname:"/hello:world"}

but it should have protocol hello and path world.

e.g. new URLPattern("urn:isbn:9780307476463") resolves to {protocol:"urn:isbn", hostname:"", pathname:"9780307476463"}

but it should have protocol urn and path isbn:9780307476463