smola/galimatias

Optionally normalize empty path segments ("//" and traling slash in path)

Opened this issue · 0 comments

Multiple slashes

Multiple slashes together are ok with standards and have a different meaning than just one slash. That is: "/foo/bar" should be translated to path segments "foo" and "bar", while "/foo//bar" is "foo", "" and "bar".

Some people uses significant empty segments in their paths (see this). However, the most common case is that multiple slashes are not significant and are produced as an unintended consequence of bad serialization.

Trailing slash

It's generally accepted that a trailing slash can be added to an URL path if there is no "file extension". (e.g. /foo -> /foo/ but not foo.html -> /foo.htnl/). However, that changes semantics according to RFC 3986 and might break well-formed URLs in lots of cases.

Further considerations

Both of these normalizations can break standard-compliant URLs. So they should be optional and the user should be warned. Also, when to perform this normalization (during parsing or after parsing) is important, since it can change the result of /../.

Proper processing of these cases (as Google seems to be doing) is normalizing according to the result of fetching the URL and processing redirects and <link rel="canonical">.

Because of all of this, I still doubt that providing these normalizations in Galimatias is a sane choice.