barseghyanartur/tld

Domains mistakenly taken as TLD

Closed this issue · 14 comments

Some domains are recognized by the library as a TLD, and their subdomain to be the actual domain.

Here are some examples:

from tld import get_tld

get_tld('https://a.b.c.theworkpc.com', fix_protocol=True, as_object=True)
get_tld('https://www.m-o-f-365.oa.r.appspot.com', fix_protocol=True, as_object=True)

Hello @OrEshed, it seems theworkpc.com is a TLD as per this line of the list. oa.r.appspot.com also is one as per this line. But since they are private domains, you can still filter them out by using the search_private kwarg set to True if this is useful for your use case.

search_private set to True is the default, and it didn't work. Any other trick I can use?

I meant to set it to False, sorry :)

NP :)
I tried False but got this back:
TldDomainNotFound: Domain a.b.c.theworkpc.com didn't match any existing TLD name!
Although the domain structure is quite simple and the TLD is obviously .COM

@OrEshed to avoid the raised error you can set fail_silently to True.

Although the domain structure is quite simple and the TLD is obviously .COM

No, the TLD really is theworkpc.com according to latest DNS data. It's possessed by them.

But I agree that if we ignore private TLDs then raising an error on your case is a bit weird. @barseghyanartur do you think we should adapt the code to return .com in this particular case?

My problem is in obtaining the sub-domains string. I would like to get back a.b.c, rather than a.b. For my use-case, private TLDs do not count but only ICANN TLDs.

But I agree that if we ignore private TLDs then raising an error on your case is a bit weird. @barseghyanartur do you think we should adapt the code to return .com in this particular case?

Would making a separate BaseMozillaPublicOnlyTLDSourceParser with absolutely no information on private TLDs and then subclassing it as MozillaPublicOnlyTLDSourceParser solve the issue?

get_tld("https://a.b.c.theworkpc.com", parser_class=MozillaPublicOnlyTLDSourceParser)

Could work but we could also fix the trie traversal algorithm to handle this case (while it requires backtracking I am afraid, if we don't keep specialized leaf information for the only public and only private case).

@Yomguithereal:

Works in branch feature/separate-public-only-tlds.

In [1]: from tld import get_tld

In [2]: url = 'https://a.b.c.theworkpc.com'

In [3]: get_tld(url, search_private=False)
Out[3]: 'com'

In [4]: get_tld(url)
Out[4]: 'theworkpc.com'

The only down side of separate public-only TLDs is, is that it uses more RAM and I still need to run speed tests of master branch version versus this branch for that specific case.

But, I'm all for your solution as well. Do you think it will introduce performance issues?

@Yomguithereal:

It seems to work pretty well and performance remains almost the same. This, however, creates a tiny backwards incompatibility (in terms of behaviour), but I'll mention that in the changelog.

This, however, creates a tiny backwards incompatibility (in terms of behaviour)

Indeed but since it can be seen as a bug I guess this is not a deal breaker?

@Yomguithereal:

Nope. The "fix" for old behaviour, would be explicit provision of parser_class=MozillaTLDSourceParser, as shown in updated tests.

Fixed in 0.12.3.