john-kurkowski/tldextract

Incorrect exctraction of domain

Closed this issue · 5 comments

Good morning,

Description:

I use tldextract but today I have found a bug while extracting a url. Samples provided below

Version Tested:

Successfully installed requests-file-1.5.1 tldextract-3.4.0

Samples:

tldextract.extract('http://vic.gov.au/')
tldextract.extract('http://www.vic.gov.au/')

Execution:

ExtractResult(subdomain='', domain='', suffix='vic.gov.au')
ExtractResult(subdomain='', domain='www', suffix='vic.gov.au')

Although vic should be the domain in both cases. As shown in publicsuffixlist gov.au is a valid 2LD.

As shown in publicsuffixlist gov.au is a valid 2LD.

You're looking at this line, referencing gov.au, declaring it a public suffix, right? Look a few lines down in that same hunk. vic.gov.au is also a public suffix.

Hi @john-kurkowski,

yes you are right, i did not notice that.
Although it is kind of confusing now because all 3LD could be possible domain names also.
IE: catholic.edu.au is a 3LD although it can be a domain too as seen with vic.gov.au besides there is a rule I do not know.

Good morning @john-kurkowski,

In the publicsuffix list: krakow.pl

In tldextract it is not extracted as 2LD but as domain plus public suffix. is that correct?
Registered Domain: krakow.pl | Domain: krakow | FQDN: www.cm-uj.krakow.pl | Suffix: pl

If yes why?

PS: The same happens with other suffixes in the list ie: ras.ru, url.tw

Hi @aimtsou,

The suffixes krakow.pl, ras.ru, and url.tw appear after the line // ===BEGIN PRIVATE DOMAINS=== and are considered private domains, which are excluded from extraction by default.

Hence, they are treated differently from vic.gov.au, which appears before // ===BEGIN PRIVATE DOMAINS===.

To include private domain extraction, refer to https://github.com/john-kurkowski/tldextract#public-vs-private-domains.

Although it is kind of confusing now because all 3LD could be possible domain names also.
IE: catholic.edu.au is a 3LD although it can be a domain too as seen with vic.gov.au besides there is a rule I do not know.

I can see why that would be confusing, but that is the point of the Public Suffix List (and this library wrapping it), to know the rules/inventory of possible public suffixes, so you don't have to. I think this issue is the library working as intended.