sindresorhus/normalize-url

Semicolons are erroneously encoded in query params

Opened this issue · 4 comments

Hey,

I've had a user report the following normalization:

normalize('https://my.otrs.dom/index.pl?Action=AgentTicketZoom;TicketID=707128') == 'https://my.otrs.dom/index.pl?Action=AgentTicketZoom%3BTicketID%3D707128'

...which according to the user didn't preserve the semantics of the URL.

Checking the RFC, it appears that ; and = are part of the sub-delims non-terminal which defines a section of reserved characters that should not be encoded.

Am I missing something?

It's just URL encoded. It doesn't change any semantics of the URL:

const a = 'https://my.otrs.dom/index.pl?Action=AgentTicketZoom;TicketID=707128';
const b = 'https://my.otrs.dom/index.pl?Action=AgentTicketZoom%3BTicketID%3D707128';

new URL(a).searchParams.get('Action') === new URL(b).searchParams.get('Action')
//=> true

Mh. I assume this is because the URL implementation simply treats ; as data, which is fine, but it's not canonical.

The above-mentioned RFC says:

  reserved    = gen-delims / sub-delims

 gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

 sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
             / "*" / "+" / "," / ";" / "="

The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI.
URIs that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent. Percent-
encoding a reserved character, or decoding a percent-encoded octet
that corresponds to a reserved character, will change how the URI is
interpreted by most applications.

Incidentally,

const a = 'https://my.otrs.dom/index.pl?Action=AgentTicketZoom;TicketID=707128';
const b = 'https://my.otrs.dom/index.pl?Action=AgentTicketZoom%3BTicketID%3D707128';

new URL(a).search === new URL(b).search
//=> false

Alright, so the URI spec is being superseded by the URL spec, which uses the application/x-www-form-urlencoded format for the query string and that doesn't seem to care about the reserved characters in URIs. Wow.