NormalizedURL#to_s returns punycode, other instance methods does not
Closed this issue · 13 comments
Calling #.to_s
or #host
on a normalized URL returns a punycoded string.
Calling #domain
, #tld
, #sld
etc. on the normalized url returns the non-punycoded version of the URL. What those methods have in common is that they all uses public_suffix.
[3] pry(main)> url = Twingly::URL.parse("http://test.भारत")
=> #<Twingly::URL:0x3fe40956dc7c http://test.भारत>
[4] pry(main)> normalized_url = url.normalized
=> #<Twingly::URL:0x3fe40acdbfa8 http://www.test.xn--h2brj9c/>
[5] pry(main)> url.to_s
=> "http://test.भारत"
[6] pry(main)> normalized_url.to_s
=> "http://www.test.xn--h2brj9c/"
[7] pry(main)> normalized_url.tld
=> "भारत"
[8] pry(main)> normalized_url.domain
=> "test.भारत"
I think we need to do #68 before this can be solved, so we can check whether the strings that public_suffix_domain
returns (in these methods) should be converted to ascii.
Hmm, why?
Hmm, why?
def trd
url_trd = public_suffix_domain.trd.to_s
url_trd = convert_to_ascii(url_trd) if normalized?
url_trd
end
I see, but that's not how we are going to do it :) I think we should parse/convert only one time (when #normalized
is called), save the values for later, and return them when they are asked for.
def trd
@normalized_trd || @trd
end
I think we should expand #71 to include not only host, but all parts that can exist in the two formats
I see, but that's not have we are going to do it :) I think we should parse/convert only one time (when #normalized is called), save the values for later, and return them when they are asked for.
Ah, yes, thats a much better idea. Just ignore my comments then 😄
Sorry, talked a little bit with @walro, he thinks we should make the #normalized
method work, not keeping both states of both non-normalized and normalized
def normalized
normalized_url = addressable_uri.dup
normalized_url.scheme = normalized_scheme
normalized_url.host = normalized_host
normalized_url.path = normalized_path
self.class.parse(normalized_url)
end
Above we have only thought of addressable, I think addressable when it comes to TLDs doesn't do what we expect it to do... you can experiment with this if you want, or we can take a look together later. Gotta go now :)
Just realized that this doesn't just have to do with normalized URLs. How do we want to do in this case:
[1] pry(main)> u = Twingly::URL.parse("http://teståäö.xn--3e0b707e")
=> #<Twingly::URL:0x3fdd006c6b10 http://teståäö.xn--3e0b707e>
[2] pry(main)> u.to_s
=> "http://teståäö.xn--3e0b707e"
[3] pry(main)> u.tld
=> "한국"
I think the most logical thing would be to return the tld
in it's original form ("xn--3e0b707e", the same as .to_s
returns).
I think the most logical thing would be to return the tld in it's original form ("xn--3e0b707e", the same as .to_s returns).
How will we know what the original tld looks like if we can only extract the tld by using public_suffix, which only works with non-punycoded domains (because the suffix list doesn't contain punycoded domains)?
I think I give up on this for today 😄
Maybe we need to create the reverse public suffix list
On 8 sep. 2016, at 16:36, Mattias Roback notifications@github.com wrote:
I think the most logical thing would be to return the tld in it's original form ("xn--3e0b707e", the same as .to_s returns).
How will we know what the original tld looks like if we can only extract the tld by using public_suffix, which only works with non-punycoded domains (because the suffix list doesn't contain punycoded domains)?
I think I give up on this for today 😄
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
On the plane ride home I played around with twingly-url and public suffix 2, and I have the reverse list now :)
Maybe we need to create the reverse public suffix list
public_suffix has an add method that you can use to insert new rules into the list. Maby that can be used somehow. I'll have to try and see if I can make something work :)
Yes, that's what I used
I think the most logical thing would be to return the tld in it's original form ("xn--3e0b707e", the same as .to_s returns).
This is the case with the changes made in #90 – we no longer pass Addressable #display_uri.host
(which always converted punycoded names (xn--`)) to public suffix.
If we merge #90, we can close this issue by adding more comphrensive tests for our normalize methods, e.g. normalizing some different types of URLs and expecting correct output for each part and the whole URL.