Domains mistakenly taken as TLD
Closed this issue · 14 comments
Some domains are recognized by the library as a TLD, and their subdomain to be the actual domain.
Here are some examples:
from tld import get_tld
get_tld('https://a.b.c.theworkpc.com', fix_protocol=True, as_object=True)
get_tld('https://www.m-o-f-365.oa.r.appspot.com', fix_protocol=True, as_object=True)
search_private
set to True
is the default, and it didn't work. Any other trick I can use?
I meant to set it to False
, sorry :)
NP :)
I tried False
but got this back:
TldDomainNotFound: Domain a.b.c.theworkpc.com didn't match any existing TLD name!
Although the domain structure is quite simple and the TLD is obviously .COM
But I agree that if we ignore private TLDs then raising an error on your case is a bit weird. @barseghyanartur do you think we should adapt the code to return .com
in this particular case?
My problem is in obtaining the sub-domains string. I would like to get back a.b.c
, rather than a.b
. For my use-case, private TLDs do not count but only ICANN TLDs.
But I agree that if we ignore private TLDs then raising an error on your case is a bit weird. @barseghyanartur do you think we should adapt the code to return
.com
in this particular case?
Would making a separate BaseMozillaPublicOnlyTLDSourceParser with absolutely no information on private TLDs and then subclassing it as MozillaPublicOnlyTLDSourceParser
solve the issue?
get_tld("https://a.b.c.theworkpc.com", parser_class=MozillaPublicOnlyTLDSourceParser)
Could work but we could also fix the trie traversal algorithm to handle this case (while it requires backtracking I am afraid, if we don't keep specialized leaf information for the only public and only private case).
Works in branch feature/separate-public-only-tlds.
In [1]: from tld import get_tld
In [2]: url = 'https://a.b.c.theworkpc.com'
In [3]: get_tld(url, search_private=False)
Out[3]: 'com'
In [4]: get_tld(url)
Out[4]: 'theworkpc.com'
The only down side of separate public-only TLDs is, is that it uses more RAM and I still need to run speed tests of master branch version versus this branch for that specific case.
But, I'm all for your solution as well. Do you think it will introduce performance issues?
It seems to work pretty well and performance remains almost the same. This, however, creates a tiny backwards incompatibility (in terms of behaviour), but I'll mention that in the changelog.
This, however, creates a tiny backwards incompatibility (in terms of behaviour)
Indeed but since it can be seen as a bug I guess this is not a deal breaker?
Nope. The "fix" for old behaviour, would be explicit provision of parser_class=MozillaTLDSourceParser
, as shown in updated tests.
Fixed in 0.12.3.