whatwg/urlpattern

More consistent and robust segment wilcard generation

rubycon opened this issue · 0 comments

What is the issue with the URL Pattern Standard?

The generate a segment wildcard regexp steps generate a regex that is :

  • Not internally consistent with the full wilcard in handling newline ;
  • Relying on an obscure regex feature: inverted empty character class or empty character class complement.
  • Tricky for some current implementation of RegExp v flag.

The proposed change should make the segment wildcard more consistent with the full wildcard and, in passing, be more forgiving for buggy RegExp implementation.

When processing regex pattern for most part of an URL (expect for host and path), the generate a segment wildcard regexp method will be called with the default options for which the delimiter code point in the empty string.

The generated regex string is then [^]+?, an inverted empty character class with lazy matching. It matches every character, including newline, which is slightly different from the full wildcard (matches every character excluding newline).

But combined with the v flag required by the specs it works differently: the regex try to match a complement class instead of inverting the match. This should be equivalent when dealing with an empty class but it seems some current implementations don't handle this very well. Testing the generated regex /^([^]+?)$/v.test("foobar") with current RegExp implementations:

  • Chrome 122 (v8 12.2.219) => match
  • Deno 1.39.4 (v8 12.0.267.8) => match
  • Node 20.11 (v8 11.3.244.8) => don't match
  • Firefox (122) => don't match

This simple change would avoid dealing with the empty character class regex in the first place and avoid the newline inconsistency.

In generate a segment wildcard regexp

  1. Append "]+?" to the end of result.

by

  1. Append "\n\r]+?" to the end of result.

It ensures the character class is never empty.