whatwg/urlpattern

Handling of special character in hostname

rubycon opened this issue · 0 comments

What is the issue with the URL Pattern Standard?

According to the Web Platform Test these hostnames should throw a TypeError :

  • bad/hostname
  • bad#hostname
  • bad%hostname
  • bad\:hostname
  • bad\nhostname
  • bad\rhostname
  • bad\thostname

However the validation of hostname rely almost entirely on URL spec's internal basic parser and according to the spec these cases don't throw a TypeError.

After they're passed to the constructor, they go though the initialize steps, are passed to process a URLPatternInit but not validated because they're patterns. Then they're passed to compile a component with the canonicalize a hostname callback and finally to the basic URL parser with an empty URL Record and state override to hostname state.

  • bad\nhostname, bad\rhostname, bad\thostname: The basic URL parser strip all tabs and newline before processing the input

    2. If input contains any ASCII tab or newline, invalid-URL-unit validation error.

    3. Remove all ASCII tab or newline from input.

    So these 3 strings will be treated as badhostname and no error will be thrown. However a non failing invalid-URL-unit validation error will occur. This behaviour is consistent with the external URL API (e.g. new URL("http://bad\nhostname") is OK).

  • bad/hostname and bad#hostname: The URL parser will stop processing the input after the special character and return only bad which is safely validated.

    3. Otherwise, if one of the following is true:

    • c is the EOF code point, U+002F (/), U+003F (?), or U+0023 (#)
    • url is special and c is U+005C (\)

    bad?hostname fails in the pattern parser which expect the ? modifier to be the last character.

  • bad\:hostname: The : char is escaped in the pattern parser and bad:hostname is passed to the URL parser. When the parser encounter the : char with a hostname state state override it returns without processing any hostname.

    2. Otherwise, if c is U+003A (:) and insideBrackets is false, then:

    2. If state override is given and state override is hostname state, then return.

    After returning the hostname is null and the code later fail on an assertion when running generate a regular expression and name list.
    This case looks more like an URL spec issue, it is not consistent with the handling of the /, ? and # delimiters.

  • bad%hostname: The hostname is fully parsed by the URL parser and passed to the host parser as an opaque URL. The % is allow in opaque url but only for percent encoded values, so a non failing invalid-URL-unit validation error occur.

    3. If input contains a U+0025 (%) and the two code points following it are not ASCII hex digits, invalid-URL-unit validation error.