twingly/twingly-url

Feature: Get URL (and parts of it) in ASCII

dentarg opened this issue · 11 comments

Related to https://github.com/twingly/klondike/issues/31

Would be useful to expose #normalized_host from Addressable, because Ruby DNS libraries (stdlib, alexdalitz/dnsruby#94) can't handle IDN very well (not at all I mean).

[18] pry(main)> Dnsruby::DNS.new.getaddress(Addressable::URI.heuristic_parse("räksmörgås.josefßon.org").normalized_host)
=> #<Dnsruby::IPv4 155.4.17.102>

[19] pry(main)> Resolv.getaddress(Addressable::URI.heuristic_parse("räksmörgås.josefßon.org").normalized_host)
=> "155.4.17.102"

And because our own #normalized_host isn't at all suitable to use (we can break URLs). (The terminology here is unfortunate...)

walro commented

Don't we really want a "to punycode" method somewhere?

jage commented

Don't we really want a "to punycode" method somewhere?

That would be better. I'm guessing it would be nice to get both host and all strings that contain host in both ascii and utf8.

I'm not really following...

Is it something like this what we mean?

Twingly::URL.parse("http://räksmörgås.josefßon.org/foobar").to_punycode
# => "http://xn--rksmrgs-5wao1o.josefsson.org/foobar"

How should we implement it? Should we use http://www.rubydoc.info/gems/addressable/Addressable/URI#normalized_host-instance_method or not? In my mind that's the most straight forward and lowest cost thing to do

Please elaborate your thoughts! :)

I haven't read everything at https://en.wikipedia.org/wiki/Punycode, but in my mind we only care about Punycode in the context of DNS, the host that is.

Maybe we want a method called punycoded_host?

jage commented

Is it something like this what we mean?

Yes.

I haven't read everything at https://en.wikipedia.org/wiki/Punycode, but in my mind we only care about Punycode in the context of DNS, the host that is.

DNS is a part of HTTP.

How should we implement it? Should we use http://www.rubydoc.info/gems/addressable/Addressable/URI#normalized_host-instance_method or not? In my mind that's the most straight forward and lowest cost thing to do

Not until we've looked at alternatives.

If you need this feature now just use Adressable explicitly in your code.

The title of this issues is now less opinionated.

The punycoded TLD would also be nice to have when dealing with Internationalized ccTLDs.

I'm merging in #72 here, it's the same thing

In one project we have this:

connection = Faraday.new do |faraday|
  faraday.use FaradayMiddleware::FollowRedirects
  faraday.adapter :excon
end

escaped_url = Twingly::URL.parse(url).normalized.to_s

connection.head(escaped_url)

Not sure we should do escaping exactly like this, but it should be a part of twingly-url IMHO.

Not sure we should do escaping exactly like this

Yeah, normalizing != escaping

#71 could be expanded to cover the whole URL, and then that could be used instead of #normalized in code such as the above.

https://url.spec.whatwg.org/

Heh, I see that Pinboard says "previously saved october 2015" about the above URL and the page now says "Last Updated 25 October 2018". It sure takes some time to compile a solid standard.