to_iri() does not decode stray percents

Question

to_iri() does not decode stray percents

agoose77 opened this issue 7 years ago · 4 comments

From the specification, URIs should encode stray percents with %25.

However, when calling to_iri(), these encoded percents are not decoded.
Is this a bug?

Answer 1 · 2018-07-07T17:25:49.000Z

This is a case where "should" is an especially loaded term. When the RFC says stray percents "should" be encoded, it's not saying whether a library should do that, or the application.

As a developer you probably want to produce normalized URLs for well-behaved Internet applications. Other use cases may require dealing with some dirty data.

Because hyperlink makes some attempt at roundtrippability and backwards-compatibility, the stray percent encoding has been made available in the .normalize() method, which I heartily recommend calling :)

To demonstrate:

import hyperlink as hl

>>> hl.parse(u'http://example.com/lol%lmao').to_text()
u'http://example.com/lol%lmao'
>>> hl.parse(u'http://example.com/lol%lmao').normalize().to_text()
u'http://example.com/lol%25lmao'

It's a definite balancing act, and I try to keep the level of surprise in check. So, if you find a particularly surprising case, just include some runnable code like above, and we'll take it from there.

Answer 2 · 2018-07-07T18:13:38.000Z

I agree that what is user vs application is ill-defined, or rather, if everything was pushed onto the user, we'd only have a URI parser, which is less useful. However, in the to_iri() method, these %25 should surely be unescaped like the other percent-encoded fields, which currently doesn't happen.

Answer 3 · 2018-07-07T18:49:30.000Z

Ahh, I was distracted by the stray percents comment, now I see the question. Basically in this example:

import hyperlink as hl

>>> hl.parse(u'http://example.com/lol%lmao').normalize().to_iri()
URL.from_text(u'http://example.com/lol%25lmao')

The question is: "Why isn't %25 turned back into a %?"

The simple answer is that % is a reserved character in the path, and reserved characters as a rule are not decoded in parts where they are reserved. Other percent-encoded characters are most definitely decoded with to_iri():

>>> print(hl.parse(u'http://example.com/beyonc%C3%A9').normalize().to_iri().to_text())
http://example.com/beyoncé

Add an extra % and hyperlink even decodes around the stray percent:

>>> print(hl.parse(u'http://example.com/beyonc%%C3%A9').normalize().to_iri().to_text())
http://example.com/beyonc%25é

Another way of looking at it is that hyperlink should never produce a "stray" percent, even when it's obviously stray (because the characters that follow are not hex).

Does that make sense? Maybe we should add a new doc titled "URLs by Example", because I think it's really tricky to learn this next-level encoding any other way.

Answer 4 · 2018-07-07T23:42:49.000Z

Another way of looking at it is that hyperlink should never produce a "stray" percent, even when it's obviously stray (because the characters that follow are not hex).

This makes sense. It did cause problems for me, but it was more a case of needing to use DecodedURL to decode the URI fragment aggressively, so I think we can close this.