Incorrect exctraction of domain
Closed this issue · 5 comments
Good morning,
Description:
I use tldextract but today I have found a bug while extracting a url. Samples provided below
Version Tested:
Successfully installed requests-file-1.5.1 tldextract-3.4.0
Samples:
tldextract.extract('http://vic.gov.au/')
tldextract.extract('http://www.vic.gov.au/')
Execution:
ExtractResult(subdomain='', domain='', suffix='vic.gov.au')
ExtractResult(subdomain='', domain='www', suffix='vic.gov.au')
Although vic should be the domain in both cases. As shown in publicsuffixlist gov.au is a valid 2LD.
As shown in publicsuffixlist gov.au is a valid 2LD.
You're looking at this line, referencing gov.au, declaring it a public suffix, right? Look a few lines down in that same hunk. vic.gov.au is also a public suffix.
Hi @john-kurkowski,
yes you are right, i did not notice that.
Although it is kind of confusing now because all 3LD could be possible domain names also.
IE: catholic.edu.au is a 3LD although it can be a domain too as seen with vic.gov.au besides there is a rule I do not know.
Good morning @john-kurkowski,
In the publicsuffix list: krakow.pl
In tldextract it is not extracted as 2LD but as domain plus public suffix. is that correct?
Registered Domain: krakow.pl | Domain: krakow | FQDN: www.cm-uj.krakow.pl | Suffix: pl
If yes why?
PS: The same happens with other suffixes in the list ie: ras.ru, url.tw
Hi @aimtsou,
The suffixes krakow.pl
, ras.ru
, and url.tw
appear after the line // ===BEGIN PRIVATE DOMAINS===
and are considered private domains, which are excluded from extraction by default.
Hence, they are treated differently from vic.gov.au
, which appears before // ===BEGIN PRIVATE DOMAINS===
.
To include private domain extraction, refer to https://github.com/john-kurkowski/tldextract#public-vs-private-domains.
Although it is kind of confusing now because all 3LD could be possible domain names also.
IE: catholic.edu.au is a 3LD although it can be a domain too as seen with vic.gov.au besides there is a rule I do not know.
I can see why that would be confusing, but that is the point of the Public Suffix List (and this library wrapping it), to know the rules/inventory of possible public suffixes, so you don't have to. I think this issue is the library working as intended.