Globstar `**` causes exponential runtime behavior
Opened this issue · 2 comments
What is the issue with the URL Pattern Standard?
I've recently come across this bug report in the popular urlpattern-polyfill library. Under "§ 2.2. Converting part lists to regular expressions", a pattern such as
new URLPattern({ hostname: '**.google.com' })
is converted to the regex
/^((?:.*)*)\.google\.com$/u
. This regular expression runs in exponential time under V8, and even breaks on sufficiently long input in SpiderMonkey with the error "InternalError: too much recursion". Of the major JS engines, only WebKit is able to execute it in linear time. The globstar pattern is very common in many globbing languages/libraries, for example minimatch. Software wanting to adopt the standardized URLPattern in exchange for the underspecified space of globbing dialects, will eventually have to deal with end user's code subtly breaking due to this (example).
To avoid this, and under the assumption that there is not much value in emitting ((?:.*)*)
, would it maybe be possible to interpret globstars as unnamed groups, i.e. (.*)
? I suspect that there was likely a reason not to support this pattern, but I haven't been able to find any discussion around the topic of globstars in relation to this spec, so I wanted to at least raise it publicly.
Interestingly, while the exponential runtime behavior happens with the polyfill, the native V8 URLPattern implementation does not appear to suffer the same issue.
Interesting. The way a series of modifiers works is, in a different way, also interesting.
This tokenizes as asterisk asterisk char(.) char(g) char(o) ...
, which gets turned into:
- a part whose type is
full-wildcard
and which has a modifier ofzero-or-more
- a fixed part
It seems not too hard to just drop the modifier when we have a full wildcard. Would that be enough? I think you could still make it exponential in such regex engines by writing ([^])*
or similar, but it'd at least be less of a footgun.
Honestly I'm not sure of the utility of applying any of the modifiers to full-wildcard
parts. Is that syntax (writing **
, *?
or *+
) useful in any way, or is it always exactly equivalent to *
, including the result of trying to match it. I think it's equivalent so we could just drop the modifier, or maybe even reject such patterns since the author might be trying to do something else (would need investigation if that's web-compatible to do even if we wanted to).
Dropping the modifier under those conditions is enough to generate a benign /^(.*)\.google\.com$/u
RegEx. I've experimentally added the logic before step 7.1; not sure if it should/can apply if there's a part prefix or suffix, as I haven't fully understood those scenarios.
Regarding *?
and *+
, I also think they would technically be equivalent to **
, but rejecting those patterns might be more helpful, because the developer most likely meant to refer to a literal asterisk in this case, i.e. they should use ([*]?)
/ ([*]+)
. But I don't have a strong opinion on that, and, of course, web-compatibility would have to be checked.
I've also ran both through the unmodified urlpattern-polyfill:
*?
becomes(.*)?
, as expected, afaict, under step 7.1. (no exponential runtime)*+
becomes((?:.*)+)
, as expected, afaict, under step 7.2. (exponential runtime)
So interestingly, there already is some inconsistency, where *?
behaves a bit more sensibly.