twingly/twingly-url

NormalizedURL#to_s returns punycode, other instance methods does not

Closed this issue · 13 comments

Calling #.to_s or #host on a normalized URL returns a punycoded string.

Calling #domain, #tld, #sld etc. on the normalized url returns the non-punycoded version of the URL. What those methods have in common is that they all uses public_suffix.

[3] pry(main)> url = Twingly::URL.parse("http://test.भारत")
=> #<Twingly::URL:0x3fe40956dc7c http://test.भारत>
[4] pry(main)> normalized_url = url.normalized
=> #<Twingly::URL:0x3fe40acdbfa8 http://www.test.xn--h2brj9c/>
[5] pry(main)> url.to_s
=> "http://test.भारत"
[6] pry(main)> normalized_url.to_s
=> "http://www.test.xn--h2brj9c/"
[7] pry(main)> normalized_url.tld
=> "भारत"
[8] pry(main)> normalized_url.domain
=> "test.भारत"

I think we need to do #68 before this can be solved, so we can check whether the strings that public_suffix_domain returns (in these methods) should be converted to ascii.

Hmm, why?

Hmm, why?

    def trd
      url_trd = public_suffix_domain.trd.to_s
      url_trd = convert_to_ascii(url_trd) if normalized?
      url_trd
    end

I see, but that's not how we are going to do it :) I think we should parse/convert only one time (when #normalized is called), save the values for later, and return them when they are asked for.

    def trd
      @normalized_trd || @trd
    end

I think we should expand #71 to include not only host, but all parts that can exist in the two formats

I see, but that's not have we are going to do it :) I think we should parse/convert only one time (when #normalized is called), save the values for later, and return them when they are asked for.

Ah, yes, thats a much better idea. Just ignore my comments then 😄

Sorry, talked a little bit with @walro, he thinks we should make the #normalized method work, not keeping both states of both non-normalized and normalized

    def normalized
      normalized_url = addressable_uri.dup

      normalized_url.scheme = normalized_scheme
      normalized_url.host   = normalized_host
      normalized_url.path   = normalized_path

      self.class.parse(normalized_url)
    end

Above we have only thought of addressable, I think addressable when it comes to TLDs doesn't do what we expect it to do... you can experiment with this if you want, or we can take a look together later. Gotta go now :)

Just realized that this doesn't just have to do with normalized URLs. How do we want to do in this case:

[1] pry(main)> u = Twingly::URL.parse("http://teståäö.xn--3e0b707e")
=> #<Twingly::URL:0x3fdd006c6b10 http://teståäö.xn--3e0b707e>
[2] pry(main)> u.to_s
=> "http://teståäö.xn--3e0b707e"
[3] pry(main)> u.tld
=> "한국"

I think the most logical thing would be to return the tld in it's original form ("xn--3e0b707e", the same as .to_s returns).

I think the most logical thing would be to return the tld in it's original form ("xn--3e0b707e", the same as .to_s returns).

How will we know what the original tld looks like if we can only extract the tld by using public_suffix, which only works with non-punycoded domains (because the suffix list doesn't contain punycoded domains)?

I think I give up on this for today 😄

Maybe we need to create the reverse public suffix list

On 8 sep. 2016, at 16:36, Mattias Roback notifications@github.com wrote:

I think the most logical thing would be to return the tld in it's original form ("xn--3e0b707e", the same as .to_s returns).

How will we know what the original tld looks like if we can only extract the tld by using public_suffix, which only works with non-punycoded domains (because the suffix list doesn't contain punycoded domains)?

I think I give up on this for today 😄


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

On the plane ride home I played around with twingly-url and public suffix 2, and I have the reverse list now :)

Maybe we need to create the reverse public suffix list

public_suffix has an add method that you can use to insert new rules into the list. Maby that can be used somehow. I'll have to try and see if I can make something work :)

Yes, that's what I used

I think the most logical thing would be to return the tld in it's original form ("xn--3e0b707e", the same as .to_s returns).

This is the case with the changes made in #90 – we no longer pass Addressable #display_uri.host (which always converted punycoded names (xn--`)) to public suffix.

If we merge #90, we can close this issue by adding more comphrensive tests for our normalize methods, e.g. normalizing some different types of URLs and expecting correct output for each part and the whole URL.