Domains mistakenly taken as TLD

Question

Domains mistakenly taken as TLD

Closed this issue 4 years ago · 14 comments

Some domains are recognized by the library as a TLD, and their subdomain to be the actual domain.

Here are some examples:

from tld import get_tld

get_tld('https://a.b.c.theworkpc.com', fix_protocol=True, as_object=True)
get_tld('https://www.m-o-f-365.oa.r.appspot.com', fix_protocol=True, as_object=True)

Answer 1 · 2020-11-23T16:02:06.000Z

Hello @OrEshed, it seems theworkpc.com is a TLD as per this line of the list. oa.r.appspot.com also is one as per this line. But since they are private domains, you can still filter them out by using the search_private kwarg set to True if this is useful for your use case.

Answer 2 · 2020-11-23T16:26:50.000Z

search_private set to True is the default, and it didn't work. Any other trick I can use?

Answer 3 · 2020-11-23T16:28:48.000Z

I meant to set it to False, sorry :)

Answer 4 · 2020-11-23T16:30:31.000Z

NP :)
I tried False but got this back:
TldDomainNotFound: Domain a.b.c.theworkpc.com didn't match any existing TLD name!
Although the domain structure is quite simple and the TLD is obviously .COM

Answer 5 · 2020-11-23T16:50:19.000Z

@OrEshed to avoid the raised error you can set fail_silently to True.

Although the domain structure is quite simple and the TLD is obviously .COM

No, the TLD really is theworkpc.com according to latest DNS data. It's possessed by them.

Answer 6 · 2020-11-23T17:55:15.000Z

But I agree that if we ignore private TLDs then raising an error on your case is a bit weird. @barseghyanartur do you think we should adapt the code to return .com in this particular case?

Answer 7 · 2020-11-24T10:53:29.000Z

My problem is in obtaining the sub-domains string. I would like to get back a.b.c, rather than a.b. For my use-case, private TLDs do not count but only ICANN TLDs.

Answer 8 · 2020-11-24T11:09:55.000Z

But I agree that if we ignore private TLDs then raising an error on your case is a bit weird. @barseghyanartur do you think we should adapt the code to return .com in this particular case?

Would making a separate BaseMozillaPublicOnlyTLDSourceParser with absolutely no information on private TLDs and then subclassing it as MozillaPublicOnlyTLDSourceParser solve the issue?

get_tld("https://a.b.c.theworkpc.com", parser_class=MozillaPublicOnlyTLDSourceParser)

Answer 9 · 2020-11-24T11:20:22.000Z

Could work but we could also fix the trie traversal algorithm to handle this case (while it requires backtracking I am afraid, if we don't keep specialized leaf information for the only public and only private case).

Answer 10 · 2020-11-24T11:42:33.000Z

@Yomguithereal:

Works in branch feature/separate-public-only-tlds.

In [1]: from tld import get_tld

In [2]: url = 'https://a.b.c.theworkpc.com'

In [3]: get_tld(url, search_private=False)
Out[3]: 'com'

In [4]: get_tld(url)
Out[4]: 'theworkpc.com'

The only down side of separate public-only TLDs is, is that it uses more RAM and I still need to run speed tests of master branch version versus this branch for that specific case.

But, I'm all for your solution as well. Do you think it will introduce performance issues?

Answer 11 · 2020-11-25T22:50:41.000Z

@Yomguithereal:

It seems to work pretty well and performance remains almost the same. This, however, creates a tiny backwards incompatibility (in terms of behaviour), but I'll mention that in the changelog.

Answer 12 · 2020-11-25T23:08:07.000Z

This, however, creates a tiny backwards incompatibility (in terms of behaviour)

Indeed but since it can be seen as a bug I guess this is not a deal breaker?

Answer 13 · 2020-11-25T23:15:54.000Z

@Yomguithereal:

Nope. The "fix" for old behaviour, would be explicit provision of parser_class=MozillaTLDSourceParser, as shown in updated tests.

Answer 14 · 2020-11-26T22:40:20.000Z

Fixed in 0.12.3.