NET-A-PORTER/scala-uri

Support for tld

Closed this issue · 8 comments

Hi,
I'm looking for a utility to fetch tld.
eg: For url = https://www.google.co.uk , tld should be uk and sld should be co.uk . Is there a way to achieve this ?

theon commented

Hi @dkjhanitt,

I think this is a good candidate for an enhancement to scala-uri. I have added a first attempt and published it as version 0.4.12-SNAPSHOT for you to try out.

It works by using the list of public suffixes from publicsuffix.org as there is no way to do this algorithmically.

Let me know if 0.4.12-SNAPSHOT works for you. There are a few things I'd like to do before cutting a proper release:

  • Run some mirco benchmarks. Current implementation uses a Trie. Is this the most efficient approach? I'd like to compare against a simple Set of all the public suffixes and calling contains with the last dot separated segment from the hostname, then repeating with each next dot separated segment until it returns false.
  • If a Trie is the best approach, are there any identical parts of the tree that can be reused. The current Trie is 701kb encoded as JSON!
theon commented

To use the SNAPSHOT version, you will probably have to add a resolver to your SBT build. See: https://github.com/NET-A-PORTER/scala-uri#latest-snapshot-builds

Hi @theon ,
Thanks for the response. I tried to use the

"com.netaporter" %% "scala-uri" % "0.4.12-SNAPSHOT"

version, but running into the following error

Exception in thread "main" java.io.FileNotFoundException: src/main/resources/public_suffix_trie.json (No such file or directory)

Alternatively, I came across Google Guava which also grabs data from the publicsuffix.org

    Build.sbt dependency
    "com.google.guava" % "guava" % "16.0",
  import com.google.common.net.InternetDomainName
  val url = "mail.google.com"
  val url1 = "mail.google.co.uk"
  val id1 = InternetDomainName.from(url1)
  val id = InternetDomainName.from(url)
  println(id.topPrivateDomain(), id.parts(), id.publicSuffix())
  println(id1.topPrivateDomain(), id1.parts(), id1.publicSuffix())

Which prints

(google.com, [mail, google, com], com)
(google.co.uk, [mail, google, co, uk], co.uk)

Hi @theon ,
I created a pull request... Please review it and see if you can merge it into master.... #109

theon commented

Hi @dkjhanitt,

Thanks for getting back. The FileNotFoundException should be fixed now for 0.4.12-SNAPSHOT, sorry about that. I'll comment on the PR over there.

theon commented

I ran some benchmarks and am happy with the run time characteristics. The scalameter tests come out with about 10 nanoseconds for a uri.publicSuffix call. A crappy homemade benchmark comes out with 0 nanoseconds, probably because the call takes less than the resolution of System.nanoTime. Calling uri.publicSuffix to get a .com suffix should result in five .get() calls to five small maps (about 0-30 items), so I guess 10 nanoseconds sounds about right?

Memory wise, the Trie takes about 1.7MB of heap which isn't great, but that memory should only be consumed for users who call .publicSuffix and existing users should be unaffected. We can look at options to reduce memory usage if it becomes an issue for anyone.

scala-uri public suffixes heap usage

Based on this, I will cut version 0.4.12 this evening.

theon commented

0.4.12 has been released with this change.