john-kurkowski/tldextract

bad parsing

Closed this issue · 3 comments

Hi,
found 2 issues while parsing domains.

tldextract.extract("hokkaido.jp")
ExtractResult(subdomain='', domain='', suffix='hokkaido.jp')

tldextract.extract("ketrzyn.pl")
ExtractResult(subdomain='', domain='', suffix='ketrzyn.pl')

Having the same issue with ne.jp

Not sure if relevant, but ne.jp is actually incorrect, it should be www.ne.jp. Working with a legacy system which strips www from urls. When run on www.ne.jp it works, but that causes other bugs for me.

These are all suffixes on the public sources list. That's why it's like this.

https://publicsuffix.org/list/public_suffix_list.dat

@erpatrik, @ShmuelTreiger is correct, the domains you're testing are in the public sufffix list so this library is correctly returning them as such.