Support for tld
Closed this issue · 8 comments
Hi,
I'm looking for a utility to fetch tld.
eg: For url = https://www.google.co.uk , tld should be uk and sld should be co.uk . Is there a way to achieve this ?
Hi @dkjhanitt,
I think this is a good candidate for an enhancement to scala-uri. I have added a first attempt and published it as version 0.4.12-SNAPSHOT
for you to try out.
It works by using the list of public suffixes from publicsuffix.org as there is no way to do this algorithmically.
Let me know if 0.4.12-SNAPSHOT
works for you. There are a few things I'd like to do before cutting a proper release:
- Run some mirco benchmarks. Current implementation uses a Trie. Is this the most efficient approach? I'd like to compare against a simple
Set
of all the public suffixes and calling contains with the last dot separated segment from the hostname, then repeating with each next dot separated segment until it returns false. - If a Trie is the best approach, are there any identical parts of the tree that can be reused. The current Trie is 701kb encoded as JSON!
To use the SNAPSHOT version, you will probably have to add a resolver to your SBT build. See: https://github.com/NET-A-PORTER/scala-uri#latest-snapshot-builds
For usage see here: http://github.com/NET-A-PORTER/scala-uri#public-suffixes
Hi @theon ,
Thanks for the response. I tried to use the
"com.netaporter" %% "scala-uri" % "0.4.12-SNAPSHOT"
version, but running into the following error
Exception in thread "main" java.io.FileNotFoundException: src/main/resources/public_suffix_trie.json (No such file or directory)
Alternatively, I came across Google Guava which also grabs data from the publicsuffix.org
Build.sbt dependency
"com.google.guava" % "guava" % "16.0",
import com.google.common.net.InternetDomainName
val url = "mail.google.com"
val url1 = "mail.google.co.uk"
val id1 = InternetDomainName.from(url1)
val id = InternetDomainName.from(url)
println(id.topPrivateDomain(), id.parts(), id.publicSuffix())
println(id1.topPrivateDomain(), id1.parts(), id1.publicSuffix())
Which prints
(google.com, [mail, google, com], com)
(google.co.uk, [mail, google, co, uk], co.uk)
Hi @dkjhanitt,
Thanks for getting back. The FileNotFoundException
should be fixed now for 0.4.12-SNAPSHOT
, sorry about that. I'll comment on the PR over there.
I ran some benchmarks and am happy with the run time characteristics. The scalameter tests come out with about 10 nanoseconds for a uri.publicSuffix
call. A crappy homemade benchmark comes out with 0 nanoseconds, probably because the call takes less than the resolution of System.nanoTime. Calling uri.publicSuffix
to get a .com
suffix should result in five .get()
calls to five small maps (about 0-30 items), so I guess 10 nanoseconds sounds about right?
Memory wise, the Trie takes about 1.7MB of heap which isn't great, but that memory should only be consumed for users who call .publicSuffix
and existing users should be unaffected. We can look at options to reduce memory usage if it becomes an issue for anyone.
Based on this, I will cut version 0.4.12
this evening.
0.4.12
has been released with this change.