URLPattern usage
Closed this issue · 16 comments
Colleagues and I were curious if Chromium had any plans it could share around whatwg/urlpattern#191. As running JS regular expressions in networking doesn't seem like it will fly.
Is the idea to have some kind of safe subset?
For compression dictionaries, the plan right now is to not allow regexp at all and bring URLPattern's use closer to what we had in place already with with wildcard syntax (which there has already been pushback on the IETF side about).
For what it's worth, servers largely don't need to parse the URLPattern and use it for matching. They provide the match pattern (likely an opaque application-controlled string) and then the client does the matching. The server largely only cares if the requested resource is available with the requested dictionary after the client has already decided which dictionary it should use.
I think that really depends on the client implementation, no? One might want to implement the matching in the networking layer and as such similar concerns apply.
And that also seems like a client-server-centric view as server-server will also want to use this technology.
Yes, sorry - in that case one of the servers is the client to another server. Client more in the sense of HTTP, "thing that sends a request", not necessarily end-user device.
Middle-boxes like CDN's or other reverse proxies are largely still behaving like "servers" for compression dictionaries and don't need to store and match dictionaries to a given path and would still rely on the end client to do the matching.
It's probably also worth noting that the Available-Dictionary:
and dictionary-compressed responses don't necessarily need to rely on Use-As-Dictionary:
and path matching. In a closed server-to-server case or well-known client (native), the dictionaries can be pre-shared and advertised directly using the same mechanism and not relying on the URLPattern matching that comes with Use-As-Dictionary
I think we might still be talking past each other, but a variant of URLPattern that does not have regular expression support makes sense to me.
It's always been the goal of URLPattern to support some call sites rejecting URLPatterns that contain regexp groups.
The original plan was just to have the calling specs do something like:
- Let urlPattern be the result of [parsing] inputString.
- If urlPattern [has regexp groups], then (reject it).
where [parsing] and [has regexp groups] are <dfn>
s in the URL Pattern spec. This was part of what I was pointing at in whatwg/urlpattern#182. In other words, defining [has regexp groups] is a prerequisite for properly specifying service worker static routing and compression dictionary transport, so I suspect that will be worked on soon.
@yoshisatoyanagisawa expressed concern about giving this "has regexp groups" predicate only to spec-writers, and not to web developers. That is, he thought it would be too difficult for web developers to debug why their URL patterns were getting rejected by (for example) service worker static routing, since new URLPattern(...)
would not reject the same strings.
His proposal is whatwg/urlpattern#191. Then, web developers can figure out easily if an input string x
will be rejected by such APIs, by testing (e.g., in their DevTools console, or in Node.js) (new URLPattern(x)).hasRegExpGroups
.
Hi @pmeenan,
@jeremyroman has seen hints somewhere that there's a desire to split the URL pattern usage into some sort of match-path
and match-search
, where each only accepts a subset of the URL pattern string syntax. As part of the work in whatwg/urlpattern#199 on trying to ensure the specification ecosystem uses the same syntax everywhere, we'd like to discuss why you're going this route, and if there's any way we could avoid it. Ideally, all specifications that accept URL patterns as strings do so in the same way.
(That said, I can't find any matches for those keywords in current public documentation, so maybe we're off base?)
@jeremyroman has seen hints somewhere that there's a desire to split the URL pattern usage into some sort of
match-path
andmatch-search
, where each only accepts a subset of the URL pattern string syntax. As part of the work in whatwg/urlpattern#199 on trying to ensure the specification ecosystem uses the same syntax everywhere, we'd like to discuss why you're going this route, and if there's any way we could avoid it. Ideally, all specifications that accept URL patterns as strings do so in the same way.
It's part of the in-flight changes to the draft with the move to URLPattern from the wildcard string.
Compression dictionary transport's use of a match pattern MUST
be same origin as the request and there was a discussion/issue to make it not-possible to even specify the origin part making only the path and search parts specifiable (since hashes don't exist at the http level).
In the vast majority of cases the search part will likely use the default wildcard so it's only the path portion that most users will be using.
Is there a problem with specifying a more-rigid URLPatternInit instead of adding the complexity of the USVString case? Otherwise we'd also need a bunch of language about the subset of pattern strings that are allowed.
Is there a problem with specifying a more-rigid URLPatternInit instead of adding the complexity of the USVString case?
Yes, this fragments the API used across the web platform. In particular, you're not using URLPatternInit, or the actually-recommended URLPatternCompatible; you're using a subset of URLPatternInit's fields.
Otherwise we'd also need a bunch of language about the subset of pattern strings that are allowed.
This isn't true. You can just ensure that the matches always return false.
Concretely, if someone supplies https://other.example.com/*
as a pattern, and you try to match it against https://example.com/foo
, this will automatically return false. No extra specification is needed.
recommended URLPatternCompatible
Do you have a link to docs for URLPatternCompatible? I can't find it in the spec or explainer for URLPattern and Google search isn't finding it.
You can just ensure that the matches always return false
We need to do the same-origin check at the time the dictionary is validated and stored to make sure that the pattern refers to the same origin as baseURL
which will be set to the URL of the request that the dictionary was served from.
i.e. If the dictionary resource is https://example.com/12345/main.js
we need to make sure the match pattern is restricted to https://example.com/
.
It's certainly doable, but practically that means constructing the URL pattern, constructing a URL from the URLPattern components and then comparing the origin of the reconstructed URL to the origin of the dictionary request to make sure they are the same.
That's compared to guaranteeing the origin matches by not allowing any of the origin components to be specified and writing spec language that says that even though the string URLPattern constructor is being used, the string SHOULD
not include any components before the path and MUST
be same-origin.
I don't mind making the change but it would help if the requirement to always uses strings and not allow for dictionaries or other component-based construction for URLPattern was solidified (as best as I can tell, it is being discussed but hasn't been merged).
recommended URLPatternCompatible
Do you have a link to docs for URLPatternCompatible? I can't find it in the spec or explainer for URLPattern and Google search isn't finding it.
It's in an in-flight PR at the moment; preview available here: https://whatpr.org/urlpattern/199.html#typedefdef-urlpatterncompatible
We need to do the same-origin check at the time the dictionary is validated and stored to make sure that the pattern refers to the same origin as
baseURL
which will be set to the URL of the request that the dictionary was served from.i.e. If the dictionary resource is
https://example.com/12345/main.js
we need to make sure the match pattern is restricted tohttps://example.com/
.It's certainly doable, but practically that means constructing the URL pattern, constructing a URL from the URLPattern components and then comparing the origin of the reconstructed URL to the origin of the dictionary request to make sure they are the same.
Naively I wouldn't have expected additional logic to be required, except possibly to assist developers trying impossible things. I'm imagining that:
- each origin has a list of
(pattern, dictionary, ttl, type, ...)
tuples - to see if a dictionary exists you find the list for that origin and compare the request URL with each of the patterns in turn
In that case, even if https://example.com/
registers a dictionary for the pattern https://not-example.com/*
, no URL for https://not-example.com
will ever be looked up in the table for https://example.com
, so it just kinda sits there, never matching ever. An implementation could optimize by not storing it and/or warn the developer that they have done something silly, but it wouldn't be required for correctness to do so (and so spec language wouldn't be required).
The upside of using the general syntax is greater consistency with other uses of URL patterns (and more likely to set a precedent for other uses in HTTP header fields that may have slightly different constraints). Though at least here I think the transformation is not too bad (match="aaa", match-search="bbb"
-> {pathname: "aaa", search: "bbb", baseURL: "[response URL]"}
or possibly {protocol: "*", hostname: "*", port: "*", username: "*", password: "*", pathname: "aaa", search: "bbb", hash: "*", baseURL: "[response URL]"}
).
The downside of the general syntax is that it might create a slight expectation of being able to match things that aren't actually possible in this context, like cross-origin URLs and URLs with a fragment. I think this would be stronger in cases where this would be processed by a server that might not know the protocol and/or port the client is seeing (due to reverse proxy) and so constructing patterns with wildcards in those fields might be effectively necessary.
Aside: that draft is currently using the request URL (unclear to me if it's the initial or final one); hopefully that corresponds to the URL after redirects (and possibly after service worker magic, if this processing might occur after that -- not sure where this is layered).
It's in an in-flight PR at the moment; preview available here: https://whatpr.org/urlpattern/199.html#typedefdef-urlpatterncompatible
Thanks. I'll have to wait for the PR to get resolved before it is something I could reference but as it looks right now, URLPatternCompatible is specific to the Javascript API. Probably more interesting is the JSON interface below it which can take a sparse map of URLPatternInit components which is not terribly different (just how the map is structured).
- each origin has a list of
(pattern, dictionary, ttl, type, ...)
tuples
Not quite. There is no mandate on where or how the matches should be stored but, at least with Chrome, they are partitioned by network isolation key (document origin, not request origin). It's certainly possible to also store the original request origin for each dictionary and that can be an implementation detail but it's a lot less efficient than not allowing the invalid dictionary to be stored in the first place.
The same-origin check still needs to be specified at either match time or storage time even if it is just to mention that the dictionary is only valid if the URLPattern is same-origin with the request.
Aside: that draft is currently using the request URL (unclear to me if it's the initial or final one); hopefully that corresponds to the URL after redirects (and possibly after service worker magic, if this processing might occur after that -- not sure where this is layered).
As far as HTTP is concerned, redirects are their own requests that happen to have a 301/302 response code so this is guaranteed to be the final request that actually serves the dictionary. It's a browser thing that redirects are transparently handled under the covers.
https://whatpr.org/urlpattern/199.html#urlpattern-build-a-urlpattern-from-a-webidl-value looks fine if you pass a string, but it does seem like a dedicated string entry point (that calls this under the covers and maybe also creates a pretend realm or ensures that isn't needed somehow) would help clarify things for non-IDL consumers. I guess that's what the XXX below is saying as well.
As far as HTTP is concerned, redirects are their own requests that happen to have a 301/302 response code so this is guaranteed to be the final request that actually serves the dictionary.
How is that guaranteed? What if somebody provided a dictionary body with their 3xx response?
How is that guaranteed? What if somebody provided a dictionary body with their 3xx response?
It doesn't really need to be guaranteed since the 3xx response is for a request on the origin where the 3xx response came from, not for the target location. I can't imagine most clients would store a dictionary from a 3xx response but there's also no issue if they do.
How is that guaranteed? What if somebody provided a dictionary body with their 3xx response?
It doesn't really need to be guaranteed since the 3xx response is for a request on the origin where the 3xx response came from, not for the target location. I can't imagine most clients would store a dictionary from a 3xx response but there's also no issue if they do.
Sorry, I should clarify. As far as the IETF draft is concerned it shouldn't matter. When the fetch part of the spec is drafted then it should be made clear that the only a Use-As-Dictionary
header on a final response should be honored.
Closing this out as the latest IETF drafts have integrated with the URL pattern spec with regex disallowed and using the intended integration points (not relying on the JS API)