Avoid passing a 'mode' to specify behaviour of scheme-less URLs as much as possible
alwinb opened this issue · 12 comments
The URL parser in the WHATWG standard specifies the following scheme-dependent behaviour:
- Separators
\
are treated as/
in URL-strings that start with a special scheme - Drive-letter-like path components are treated as drive-letters only in file-URLs
- The host is parsed as a domain or an ipv4 address only in special URLs
- The authority must not have credentials or a port, only in file-URLs
- Resolved web-URLs must not have an empty host
Currently this is solved by explicitly passing a 'mode' that specifies how to handle scheme-less URLs.
The goal for this issue is to come up with defaults that allow working with scheme-less URLs without having to pass modes around – as much as possible.
See this thread for previous discussion.
Some thoughts:
The WHATWG spec dodges these problems by simply requiring the base URL to be known. It is unfortunate, as it currently makes it really difficult to replicate its behavior without also having that knowledge first.
E.g. you might want to be able to parse a relative URL in an <a href="...">
without having to use the base, then resolve it against the base and have it match the WHATWG spec, but that is tricky.
For the API, that means requiring mindfulness when parsing URLs, which makes it fairly inconvenient to use. It would be neat to be able to just parse URLs and resolve URLs without having to put mind to whether they are meant to be special/web/file/regular URLs and have it just work.
I wish it were possible to always elide the mode (and in fact, not even allow it to be specified) and keep sensible behavior.
With that said, here are some suggestions:
For points (4) and (5), I think it would be suffient to allow the resolution operation to fail. Then when resolving a base URL against a special URL that would cause an invalid URL to be created (because of their constraints), the operation would simply fail.
For (3), I think it would be sensible to parse the host as opaque in relative URLs, and then decided on whether to leave it intact or convert it to domain/IP address during resolution.
If the domain/IP address parsing fails, so does the resolution.
Point (2) is really conflicting to me, it really makes me think about #1 again. I think it could be worthwhile to explore that idea a bit more instead of being dismissive about it.
Let’s say, for example, that we do decide to claim that there are only “path parts” instead of the current root/dir/file model. Then we could attribute properties to those parts when they are in URLs depending on their positions.
(NB: When serializing, the slash would be put before each path part.)
We could then say that the first non‐empty path part of the URL (if there is one) is considered its drive letter if it matches the drive letter production (otherwise the URL doesn’t have a drive letter). In addition, we could say that the the first path part after the drive letter (if there is a drive letter, and if there is a path part after it) is the URL’s path root. If there is no drive letter, the first path part is would be the URL’s path root.
Then resolution could be adapted appropriately to use the URL’s “drive letter” and its “path root”, both of which would be path parts.
The unfortunate thing about this idea is that then it becomes difficult to represent path‐relative URLs succinctly. You could say “when serializing, the slash would be put after each path part”, but then other things start behaving poorly.
Another issue would be that then resolution ceases being associative. Consider:
a = "http://example"
andb = "/C:"
andc = "/"
(resolve a (resolve b c)) = (resolve a "/C:/") = "http://example/C:/"
(because relative URLs can have drive letters)(resolve (resolve a b) c) = (resolve "http://example/C:" "/") = "http://example/"
(becausehttp:
URLs don’t have drive letters)
It could perhaps be good enough to claim that only the second case (i.e. (resolve (resolve a b) c)
) should have to match the WHATWG spec behavior, but even then, it’s very unfortunate and confusing behavior.
For (1), I think there is a reasonably neat ad‐hoc approach! I think it could be enough to never acknowledge backslashes as path separators “by default” (except in special URLs) and instead provide a different operation to split paths on backslashes. This operation would then be applied automatically to relative URLs when resolving them against special URLs.
This is starting to look really good to me. I think it may actually work out! And I think 3, 4, and 5 are pretty much solved now!
I also was thinking about not parsing or storing drive-letters (and thus making the parser scheme-independent towards that aspect) and instead use a 'win-file'-goto that detects drives just ahead of time, maintaining the root/dir/file model though. Maybe //c:
and esp. file://c:
are a bit troublesome, but we'll see.
As for the backslashes, do you have a specific motivation for not considering them as separators? I think I just saw a good reason for that, or a way to see that as a pretty solution, but lost it again.
The last days I was instead leaning towards splitting on them by default. I can easily imagine people trying to copy/paste parts of windows file paths and construct scheme-less URLs that way.
Another idea was to do split on them, but to somehow store the sigil as well, so that there are in a way, two types of dir components; those that use a postfix /
as a sigil and those that use a postfix \
.
I'm thinking, for an API to be pedantic about drive letters, so that
- setting a drive on a scheme-less URL will also set its scheme to "file", and
- removing the scheme is not allowed if the (file-) URL has a drive, and
- if a scheme-less URL is parsed as having a drive (by whatever means), then the parser will insert the 'file' scheme.
Since file-URLs use legacy resolution, it is still possible to make host-relative file URLs that way.
And it circumvents the associativity problem. Then the only remaining question is how to parse scheme-less URLs that start with something that looks like a drive by default, and/or if a parser option will be allowed for that.
As for the backslashes, do you have a specific motivation for not considering them as separators? I think I just saw a good reason for that, or a way to see that as a pretty solution, but lost it again.
My idea was that e.g. "abc/def\ghi"
would be parsed as ["abc", "def\ghi"]
, and there would be an explicit (optional) operation to split paths on backslashes if needed/wanted. But then resolving that URL (before splitting) against, say, https://example.org
would implicitly apply that “split” operation to the URL, thus producing ["abc", "def", "ghi"]
.
I was doing a DFA based parser some days ago, that kept all the sigils... But I don't think doing this is worth it. You'd have to postpone normalisation then too, so, ehm, //host/😅\😱/..
could not be normalised before resolving, and it might break associativity? I should check that.
I think "scheme-less non-special URLs" are very unlikely to occur with a \
in them in the wild. And they are invalid. It makes everything so much easier to define and parse relative URL-strings as:
- Always hierarchical (no opaque paths)
- Treating
\
as/
- Having opaque hosts, if any
- Not having a drive letter (!!)
Still. I think that is a very good deal to make, if you add the following:
- I like to rename
url1 .goto (url2)
tourl2 .rebase (url1)
- And I'd allow passing a base, (no idea for the name, say)
new URLReference (input, base)
where a scheme of the base URL can be used to trigger scheme dependent parsing; after which input would be rebased on base, e.g. (base goto input). No more modes! 🎉
The reason I made my suggestion regarding backslashes is that I think it would be really confusing if “parsing a relative URL without knowing the scheme, then resolving it against an absolute URL” behaved differently than “parsing a relative URL knowing the scheme it’ll be resolved against, then resolving it”.
With my idea, you can parse "def/ghi\jkl"
by itself then resolve it against both abc://example
as well as https://example
, and in both cases it’ll work like in the WHATWG spec.
It also allows people to have the option to treat it both ways. Because you keep the backslashes intact, and give people the option to treat them specially with a separate option if they so choose. Whereas if you always treat \
as /
, then people can’t choose the other meaning without having to specify a mode again.
Allright, but what then to do with path normalisation?
It would treat backslashes as a normal character, so "a/b/c/.."
normalizes to "a/b/"
and "a/b\c/.."
normalizes to "a/"
, unless you use the splitting operation explicitly. This is similar to “having to pass in a mode”, arguably, but it is made into a separate operation. I don’t think it makes sense to take away the “mode” choice completely here, as much as it would be nice to be able to.
And note that although, yes, my idea does break associativity and also makes relative URLs behave differently when used by themselves vs. resolved against absolute URLs, so does your idea of “always treat \
as /
in relative URLs”, so I don’t think your idea is clearly better to me.
I was thinking, one could maybe add a condition, that simply prevents the .. segments from annihilating with preceding segments that have a \ in them. Until there’s a scheme. Maybe something like that would avoid such problems?
I should maybe note that I’m quite keen on having normalisation be a congruence, ehm, as in, compatible with the goto operator. Do you know what I mean?
Algebraic normal-form semantics, to throw some fancy words around.
I think something neat just occurred to me.
Would it be sensible to allow backslashes to be a part of paths in special URLs, and only split on them as part of the force
operation (and thus also as part of forced normalization)? (Only for special URLs, of course.)
So you could say that “normalization” is a normal form operation (algebraically, in relation to resolution), but that force
isn’t.
This is kinda neat because it delegates all “warts” to force
and forced resolution, including backslash handling.
So you could say that “normalization” is a normal form operation (algebraically, in relation to resolution), but that force isn’t.
This is certainty what I’ve been after. I didn’t really make it explicit, I thought it was kind of nice to just hint at that.
I think more general yet, a good strategy is to decompose problems alongside the ‘boundaries’ of well behaved formalisms and have the exceptions be in the glue code.
In this case trying to come up with something clean for the spec came out as URL algebra, (though there are some annoying details, if you’d really set out to phrase it as a mathematical theory proper).
As for the backslashes (and I’m not yet decided, but let’s disregard that for now) yes, I’d want to stick to one set of normalisation rules. Splitting on them in the force op and normalising after that to get the whatwg behaviour would be a good way to go.
I wanted to ask you how you'd like to make working together on this a bit more formal. You have been helping a lot with suggestions and you've been especially supportive. I really appreciate it. If you like, you can contact me by email, alwinb at gmail.com.