to_iri() does not decode stray percents
agoose77 opened this issue · 4 comments
From the specification, URIs should encode stray percents with %25.
However, when calling to_iri(), these encoded percents are not decoded.
Is this a bug?
This is a case where "should" is an especially loaded term. When the RFC says stray percents "should" be encoded, it's not saying whether a library should do that, or the application.
As a developer you probably want to produce normalized URLs for well-behaved Internet applications. Other use cases may require dealing with some dirty data.
Because hyperlink makes some attempt at roundtrippability and backwards-compatibility, the stray percent encoding has been made available in the .normalize() method, which I heartily recommend calling :)
To demonstrate:
import hyperlink as hl
>>> hl.parse(u'http://example.com/lol%lmao').to_text()
u'http://example.com/lol%lmao'
>>> hl.parse(u'http://example.com/lol%lmao').normalize().to_text()
u'http://example.com/lol%25lmao'It's a definite balancing act, and I try to keep the level of surprise in check. So, if you find a particularly surprising case, just include some runnable code like above, and we'll take it from there.
I agree that what is user vs application is ill-defined, or rather, if everything was pushed onto the user, we'd only have a URI parser, which is less useful. However, in the to_iri() method, these %25 should surely be unescaped like the other percent-encoded fields, which currently doesn't happen.
Ahh, I was distracted by the stray percents comment, now I see the question. Basically in this example:
import hyperlink as hl
>>> hl.parse(u'http://example.com/lol%lmao').normalize().to_iri()
URL.from_text(u'http://example.com/lol%25lmao')The question is: "Why isn't %25 turned back into a %?"
The simple answer is that % is a reserved character in the path, and reserved characters as a rule are not decoded in parts where they are reserved. Other percent-encoded characters are most definitely decoded with to_iri():
>>> print(hl.parse(u'http://example.com/beyonc%C3%A9').normalize().to_iri().to_text())
http://example.com/beyoncéAdd an extra % and hyperlink even decodes around the stray percent:
>>> print(hl.parse(u'http://example.com/beyonc%%C3%A9').normalize().to_iri().to_text())
http://example.com/beyonc%25éAnother way of looking at it is that hyperlink should never produce a "stray" percent, even when it's obviously stray (because the characters that follow are not hex).
Does that make sense? Maybe we should add a new doc titled "URLs by Example", because I think it's really tricky to learn this next-level encoding any other way.
Another way of looking at it is that hyperlink should never produce a "stray" percent, even when it's obviously stray (because the characters that follow are not hex).
This makes sense. It did cause problems for me, but it was more a case of needing to use DecodedURL to decode the URI fragment aggressively, so I think we can close this.