Wrong extraction for some valid domains
Closed this issue · 1 comments
The following domains all work, yet www
is recognized as domain instead of subdomain, and the actual domain is wrongly prepended to the suffix. Although it is clear that www
is not a domain, but rather a special subdomain, this doesn't yet negatively impact the registered_domain
and fqdn
methods.
Python 3.6.8 (default, Jan 19 2019, 21:26:02)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.10.44.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tldextract
>>> tldextract.__version__
'2.2.1'
>>> tldextract.extract('www.experts-comptables.fr')
ExtractResult(subdomain='', domain='www', suffix='experts-comptables.fr')
>>> tldextract.extract('www.gob.mx')
ExtractResult(subdomain='', domain='www', suffix='gob.mx')
>>> tldextract.extract('www.ma.gov.br')
ExtractResult(subdomain='', domain='www', suffix='ma.gov.br')
>>> tldextract.extract('www.wroclaw.pl')
ExtractResult(subdomain='', domain='www', suffix='wroclaw.pl')
But when passing the same domains without www
, the whole input ends up in the suffix, and both registered_domain
and fqdn
functions wrongly return an empty string.
I've read up on #138 and some more related issues in this repo, but I'm pretty sure this can't be intended behaviour?
Closing this issue, as these are cases of (at first instance unexpected) top level domains, and indeed www
being the domain, not a (special) subdomain as in 99% of URLs.