john-kurkowski/tldextract

Wrong extraction for some valid domains

Closed this issue · 1 comments

The following domains all work, yet www is recognized as domain instead of subdomain, and the actual domain is wrongly prepended to the suffix. Although it is clear that www is not a domain, but rather a special subdomain, this doesn't yet negatively impact the registered_domain and fqdn methods.

Python 3.6.8 (default, Jan 19 2019, 21:26:02)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.10.44.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tldextract

>>> tldextract.__version__
'2.2.1'

>>> tldextract.extract('www.experts-comptables.fr')
ExtractResult(subdomain='', domain='www', suffix='experts-comptables.fr')

>>> tldextract.extract('www.gob.mx')
ExtractResult(subdomain='', domain='www', suffix='gob.mx')

>>> tldextract.extract('www.ma.gov.br')
ExtractResult(subdomain='', domain='www', suffix='ma.gov.br')

>>> tldextract.extract('www.wroclaw.pl')
ExtractResult(subdomain='', domain='www', suffix='wroclaw.pl')

But when passing the same domains without www, the whole input ends up in the suffix, and both registered_domain and fqdn functions wrongly return an empty string.

I've read up on #138 and some more related issues in this repo, but I'm pretty sure this can't be intended behaviour?

Closing this issue, as these are cases of (at first instance unexpected) top level domains, and indeed www being the domain, not a (special) subdomain as in 99% of URLs.